Source for file Identifier.php
Documentation is available at Identifier.php
* ----------------------------------------------------------------------
* Copyright (c) 2006-2016 Khaled Al-Sham'aa.
* ----------------------------------------------------------------------
* This program is open source product; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License (LGPL)
* as published by the Free Software Foundation; either version 3
* of the License, or (at your option) any later version.
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
* You should have received a copy of the GNU Lesser General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/lgpl.txt>.
* ----------------------------------------------------------------------
* Class Name: Identify Arabic Text Segments
* Filename: Identifier.php
* Original Author(s): Khaled Al-Sham'aa <khaled@ar-php.org>
* Purpose: This class will identify Arabic text in a given UTF-8 multi
* language document, it will return array of start and end
* positions for Arabic text segments.
* ----------------------------------------------------------------------
* Identify Arabic Text Segments
* Using this PHP Class you can fully automated approach to processing
* Arabic text by quickly and accurately determining Arabic text segments within
* multiple languages documents.
* Understanding the language and encoding of a given document is an essential step
* in working with unstructured multilingual text. Without this basic knowledge,
* applications such as information retrieval and text mining cannot accurately
* process data and important information may be completely missed or mis-routed.
* Any application that works with Arabic in multiple languages documents can
* benefit from the ArIdentifier class. Using this class, applications can take a
* fully automated approach to processing Arabic text by quickly and accurately
* determining Arabic text segments within multiple languages document.
* include('./I18N/Arabic.php');
* $obj = new I18N_Arabic('Identifier');
* $hStr=$obj->highlightText($str, '#80B020');
* echo $str . '<hr />' . $hStr . '<hr />';
* $taggedText = $obj->tagText($str);
* foreach($taggedText as $wordTag) {
* list($word, $tag) = $wordTag;
* echo "$word is Noun, ";
* echo "$word is not Noun, ";
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* @copyright 2006-2016 Khaled Al-Sham'aa
* @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
* @link http://www.ar-php.org
* This PHP class identify Arabic text segments
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* @copyright 2006-2016 Khaled Al-Sham'aa
* @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
* @link http://www.ar-php.org
* Loads initialize values
public function __construct()
* Identify Arabic text in a given UTF-8 multi language string
* @param string $str UTF-8 multi language string
* @return array Offset of the beginning and end of each Arabic segment in
* sequence in the given UTF-8 multi language string
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
// ignore ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 :
// If it come in the Arabic context
if ($cDec >= 33 && $cDec <= 58) {
if (!$probAr && ($cDec == 216 || $cDec == 217)) {
$pDec = ord($str[$i - 1]);
$utfDecCode = ($pDec << 8) + $cDec;
if ($utfDecCode >= $minAr && $utfDecCode <= $maxAr) {
* Find out if given string is Arabic text or not
* @param string $str String
* @return boolean True if given string is UTF-8 Arabic, else will return False
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$arr = self::identify($str);
if (count($arr) == 1 && $arr[0] == 0) {
|