Source for file AutoSummarize.php
Documentation is available at AutoSummarize.php
* ----------------------------------------------------------------------
* Copyright (c) 2006-2016 Khaled Al-Sham'aa.
* ----------------------------------------------------------------------
* This program is open source product; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License (LGPL)
* as published by the Free Software Foundation; either version 3
* of the License, or (at your option) any later version.
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
* You should have received a copy of the GNU Lesser General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/lgpl.txt>.
* ----------------------------------------------------------------------
* Class Name: Arabic Auto Summarize Class
* Filename: AutoSummarize.php
* Original Author(s): Khaled Al-Sham'aa <khaled@ar-php.org>
* Purpose: Automatic keyphrase extraction to provide a quick mini-summary
* for a long Arabic document.
* ----------------------------------------------------------------------
* This class identifies the key points in an Arabic document for you to share with
* others or quickly scan. The class determines key points by analyzing an Arabic
* document and assigning a score to each sentence. Sentences that contain words
* used frequently in the document are given a higher score. You can then choose a
* percentage of the highest-scoring sentences to display in the summary.
* "ArAutoSummarize" class works best on well-structured documents such as reports,
* articles, and scientific papers.
* "ArAutoSummarize" class cuts wordy copy to the bone by counting words and ranking
* sentences. First, "ArAutoSummarize" class identifies the most common words in the
* document and assigns a "score" to each word--the more frequently a word is used,
* Then, it "averages" each sentence by adding the scores of its words and dividing
* the sum by the number of words in the sentence--the higher the average, the
* higher the rank of the sentence. "ArAutoSummarize" class can summarize texts to
* specific number of sentences or percentage of the original copy.
* We use statistical approach, with some attention apparently paid to:
* - Location: leading sentences of paragraph, title, introduction, and conclusion.
* - Fixed phrases: in-text summaries.
* - Frequencies of words, phrases, proper names
* - Contextual material: query, title, headline, initial paragraph
* The motivation for this class is the range of applications for key phrases:
* - Mini-summary: Automatic key phrase extraction can provide a quick mini-summary
* for a long document. For example, it could be a feature in a web sites; just
* click the summarize button when browsing a long web page.
* - Highlights: It can highlight key phrases in a long document, to facilitate
* - Author Assistance: Automatic key phrase extraction can help an author or editor
* who wants to supply a list of key phrases for a document. For example, the
* administrator of a web site might want to have a key phrase list at the top of
* each web page. The automatically extracted phrases can be a starting point for
* further manual refinement by the author or editor.
* - Text Compression: On a device with limited display capacity or limited
* bandwidth, key phrases can be a substitute for the full text. For example, an
* email message could be reduced to a set of key phrases for display on a pager;
* a web page could be reduced for display on a portable wireless web browser.
* This list is not intended to be exhaustive, and there may be some overlap in
* include('./I18N/Arabic.php');
* $obj = new I18N_Arabic('AutoSummarize');
* $file = 'Examples/Articles/Ajax.txt';
* // get contents of a file into a string
* $fhandle = fopen($file, "r");
* $c = fread($fhandle, filesize($file));
* $k = $obj->getMetaKeywords($c, $r);
* echo '<b><font color=#FFFF00>';
* echo 'Keywords:</font></b>';
* echo '<p dir="rtl" align="justify">';
* $s = $obj->doRateSummarize($c, $r);
* echo '<b><font color=#FFFF00>';
* echo 'Summary:</font></b>';
* echo '<p dir="rtl" align="justify">';
* echo '<b><font color=#FFFF00>';
* echo 'Full Text:</font></b>';
* echo '<p><a class=ar_link target=_blank ';
* echo 'href='.$file.'>Source File</a></p>';
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* @copyright 2006-2016 Khaled Al-Sham'aa
* @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
* @link http://www.ar-php.org
* This PHP class do automatic keyphrase extraction to provide a quick
* mini-summary for a long Arabic document
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* @copyright 2006-2016 Khaled Al-Sham'aa
* @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
* @link http://www.ar-php.org
private $_normalizeAlef = array('أ','إ','آ');
private $_normalizeDiacritics = array('َ','ً','ُ','ٌ','ِ','ٍ','ْ','ّ');
private $_commonChars = array('ة','ه','ي','ن','و','ت','ل','ا','س','م',
'e', 't', 'a', 'o', 'i', 'n', 's');
private $_separators = array('.',"\n",'،','؛','(','[','{',')',']','}',',',';');
private $_commonWords = array();
private $_importantWords = array();
* Loads initialize values
public function __construct()
// This common words used in cleanCommon method
$words = file(dirname(__FILE__ ). '/data/ar-stopwords.txt');
$en_words = file(dirname(__FILE__ ). '/data/en-stopwords.txt');
$this->_commonWords = $words;
// This important words used in rankSentences method
$words = file(dirname(__FILE__ ). '/data/important-words.txt');
$this->_importantWords = $words;
* Load enhanced Arabic stop words list
$extra_words = file(dirname(__FILE__ ). '/data/ar-extra-stopwords.txt');
$extra_words = array_map('trim', $extra_words);
$this->_commonWords = array_merge($this->_commonWords, $extra_words);
* Core summarize function that implement required steps in the algorithm
* @param string $str Input Arabic document as a string
* @param string $keywords List of keywords higlited by search process
* @param integer $int Sentences value (see $mode effect also)
* @param string $mode Mode of sentences count [number|rate]
* @param string $output Output mode [summary|highlight]
* @param string $style Name of the CSS class you would like to apply
* @return string Output summary requested
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
protected function summarize($str, $keywords, $int, $mode, $output, $style= null)
"/[^\.\n\،\؛\,\;](.+?)[\.\n\،\؛\,\;]/u",
$_sentences = $sentences[0];
$totalSentences = count($_sentences);
$maxChars = round($int * $totalChars / 100);
$int = round($int * $totalSentences / 100);
"/[^\.\n\،\؛\,\;](.+?)[\.\n\،\؛\,\;]/u",
$_stemmedSentences = $sentences[0];
foreach ($words as $word) {
$wordRanks[$word] = 1000;
list ($sentences, $ranks) = $sentencesRanks;
$totalSentences = count($ranks);
for ($i = 0; $i < $totalSentences; $i++ ) {
if ($sentencesRanks[1][$i] >= $minRank) {
if ($output == 'summary') {
$summary .= ' '. $sentencesRanks[0][$i];
$summary .= '<span class="' . $style . '">' .
$sentencesRanks[0][$i] . '</span>';
if ($output == 'highlight') {
$summary .= $sentencesRanks[0][$i];
if ($output == 'highlight') {
* Summarize input Arabic string (document content) into specific number of
* sentences in the output
* @param string $str Input Arabic document as a string
* @param integer $int Number of sentences required in output summary
* @param string $keywords List of keywords higlited by search process
* @return string Output summary requested
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$str, $keywords, $int, 'number', 'summary', $style
* Summarize percentage of the input Arabic string (document content) into output
* @param string $str Input Arabic document as a string
* @param integer $rate Rate of output summary sentence number as
* percentage of the input Arabic string
* @param string $keywords List of keywords higlited by search process
* @return string Output summary requested
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$str, $keywords, $rate, 'rate', 'summary', $style
* Highlight key sentences (summary) of the input string (document content)
* using CSS and send the result back as an output
* @param string $str Input Arabic document as a string
* @param integer $int Number of key sentences required to be
* highlighted in the input string
* @param string $keywords List of keywords higlited by search process
* @param string $style Name of the CSS class you would like to apply
* @return string Output highlighted key sentences summary (using CSS)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$str, $keywords, $int, 'number', 'highlight', $style
* Highlight key sentences (summary) as percentage of the input string
* (document content) using CSS and send the result back as an output.
* @param string $str Input Arabic document as a string
* @param integer $rate Rate of highlighted key sentences summary
* number as percentage of the input Arabic
* string (document content)
* @param string $keywords List of keywords higlited by search process
* @param string $style Name of the CSS class you would like to apply
* @return string Output highlighted key sentences summary (using CSS)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$str, $keywords, $rate, 'rate', 'highlight', $style
* Extract keywords from a given Arabic string (document content)
* @param string $str Input Arabic document as a string
* @param integer $int Number of keywords required to be extracting
* from input string (document content)
* @return string List of the keywords extracting from input Arabic string
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
array_push($patterns, '/\.|\n|\،|\؛|\(|\[|\{|\)|\]|\}|\,|\;/u');
$str = preg_replace('/(\W)ال(\w{3,})/u', '\\1\\2', $cleanedStr);
arsort($wordRanks, SORT_NUMERIC);
foreach ($wordRanks as $key => $value) {
$metaKeywords .= $key . '، ';
$metaKeywords = mb_substr($metaKeywords, 0, - 2);
* Normalized Arabic document
* @param string $str Input Arabic document as a string
* @return string Normalized Arabic document
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$str = str_replace($this->_normalizeDiacritics, '', $str);
'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'abcdefghijklmnopqrstuvwxyz'
* Extracting common Arabic words (roughly)
* from input Arabic string (document content)
* @param string $str Input normalized Arabic document as a string
* @return string Arabic document as a string free of common words (roughly)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* Remove less significant Arabic letter from given string (document content).
* Please note that output will not be human readable.
* @param string $str Input Arabic document as a string
* @return string Output string after removing less significant Arabic letter
* (not human readable output)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* Ranks words in a given Arabic string (document content). That rank refers
* to the frequency of that word appears in that given document.
* @param string $str Input Arabic document as a string
* @return hash Associated array where document words referred by index and
* those words ranks referred by values of those array items.
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
foreach ($words as $word) {
if (isset ($wordsRanks[$word])) {
foreach ($wordsRanks as $wordRank => $total) {
if (isset ($wordsRanks[$subWordRank])) {
unset ($wordsRanks[$wordRank]);
$wordsRanks[$subWordRank] += $total;
* Ranks sentences in a given Arabic string (document content).
* @param array $sentences Sentences of the input Arabic document
* @param array $stemmedSentences Stemmed sentences of the input Arabic
* @param array $arr Words ranks array (word as an index and
* value refer to the word frequency)
* @return array Two dimension array, first item is an array of document
* sentences, second item is an array of ranks of document
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
protected function rankSentences($sentences, $stemmedSentences, $arr)
$max = count($sentences);
for ($i = 0; $i < $max; $i++ ) {
$sentence = $sentences[$i];
} elseif (in_array($first, $this->_separators)) {
} elseif (in_array($last, $this->_separators)) {
foreach ($this->_importantWords as $word) {
if (!in_array($first, $this->_separators)) {
$sentence = $first . $sentence;
$stemStr = $stemmedSentences[$i];
$totalWords = count($words);
foreach ($words as $word) {
if (isset ($arr[$word])) {
$totalWordsRank += $arr[$word];
$wordsRank = $totalWordsRank / $totalWords;
$sentenceRanks = $w * $wordsRank;
$sentencesRanks = array($sentenceArr, $rankArr);
* Calculate minimum rank for sentences which will be including in the summary
* @param array $str Document sentences
* @param array $arr Sentences ranks
* @param integer $int Number of sentences you need to include in your summary
* @param integer $max Maximum number of characters accepted in your summary
* @return integer Minimum accepted sentence rank (sentences with rank more
* than this will be listed in the document summary)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
foreach ($str as $line) {
rsort($arr, SORT_NUMERIC);
for ($i= 0; $i<= $int; $i++ ) {
if ($totalChars >= $max) {
* Check some conditions to know if a given string is a formal valid word or not
* @param string $word String to be checked if it is a valid word or not
* @return boolean True if passed string is accepted as a valid word else
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
|