Calculate the Similarity between two Arabic Strings

Calculate the Similarity between two Arabic Strings:

This Arabic version of the PHP similar_text() function provides a robust way to compare two Arabic strings by calculating how closely they match, both character-by-character and in their overall structure. Internally, it utilizes the Needleman-Wunsch algorithm, a well-known sequence alignment technique, which assigns scores to matches, mismatches, and gaps (insertions or deletions). Three key factors influence how characters are scored against each other: keyboard proximity, graphical similarity, and phonetic similarity. Keyboard proximity involves checking how close two characters are on a typical Arabic keyboard layout; graphical similarity considers how closely characters share certain shapes; and phonetic similarity groups characters that produce similar sounds.

By analyzing these three factors, the similar_text() function combines them into a single measure of similarity for each pair of characters, and from there produces a total alignment score for the entire pair of strings. As an additional feature, the function allows retrieval of this score as a raw value and as a percentage. The percentage expresses the comparison result as a fraction of the maximum possible alignment score, giving an intuitive measure of overall closeness between the two strings.

Another powerful aspect of this functionality lies in its configurability. The setSimilarityWeight() method enables you to assign relative importance to keyboard, graphical, or phonetic similarities. If you wish to emphasize one factor (such as phonetic closeness) more strongly than others, you can increase its corresponding weight. Conversely, you can de-emphasize or even ignore one of the factors by reducing its weight to zero. This allows fine-grained control over how much each type of similarity influences the final measure, making the comparison flexible enough to cater to diverse use cases, from spell-checking and autocorrection to natural language processing and search optimization.

Example Output 1:

Comparing مدرسة with the following words:

مَدرَسة has similarity 93%
مدرسه has similarity 88%
مدرصة has similarity 82%
ندرسه has similarity 68%
مُدَرّس has similarity 60%

Example Code 1:


<?php
    $Arabic = new \ArPHP\I18N\Arabic();
    
    $Arabic->setSimilarityWeight('keyboardWeight', $keyboardWeight)
       ->setSimilarityWeight('graphicWeight', $graphicWeight)
       ->setSimilarityWeight('phoneticWeight', $phoneticWeight);

    $correctWord = 'مدرسة';
    $candidateWords = ['مدرسه', 'مَدرَسة', 'ندرسه', 'مدرصة', 'مُدَرّس'];

    $percent = 0.0;
    $results = [];

    foreach ($candidateWords as $word) {
        $score = $Arabic->similar_text($correctWord, $word, $percent);
        $results[$word] = round($percent);
    }

    // Sort in descending order of similarity percentage
    arsort($results);

    echo "Comparing $correctWord with the following words:<ol>";
    foreach ($results as $candidate => $similarity) {
        echo "<li>$candidate has similarity $similarity%</li>";
    }
?>

Related Documentation: similar_text, setSimilarityWeight

Example Output 2:

Standard PHP version:
levenshtein("استقلال", "مستقل") = 6

Multibyte String version:
mb_levenshtein("استقلال", "مستقل") = 3

Example Code 2:


<?php
    /**
     * https://www.php.net/manual/en/function.levenshtein.php#113702
     *
     * Convert an UTF-8 encoded string to a single-byte string suitable for
     * functions such as levenshtein.
     * 
     * The function simply uses (and updates) a tailored dynamic encoding
     * (in/out map parameter) where non-ascii characters are remapped to
     * the range [128-255] in order of appearance.
     *
     * Thus it supports up to 128 different multibyte code points max over
     * the whole set of strings sharing this encoding.
     */
    function utf8_to_extended_ascii($str, &$map)
    {
        // find all multibyte characters (cf. utf-8 encoding specs)
        $matches = array();
        if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
            return $str; // plain ascii string
        
        // update the encoding map with the characters not already met
        foreach ($matches[0] as $mbc)
            if (!isset($map[$mbc]))
                $map[$mbc] = chr(128 + count($map));
        
        // finally remap non-ascii characters
        return strtr($str, $map);
    }

    /*
     * Didactic example showing the usage of the previous conversion function but,
     * for better performance, in a real application with a single input string
     * matched against many strings from a database, you will probably want to
     * pre-encode the input only once.
     */
    function mb_levenshtein($string1, $string2, $insertion_cost = 1, $replacement_cost = 1, $deletion_cost = 1)
    {
        $charMap = array();
        $string1 = utf8_to_extended_ascii($string1, $charMap);
        $string2 = utf8_to_extended_ascii($string2, $charMap);

        return levenshtein($string1, $string2, $insertion_cost, $replacement_cost, $deletion_cost);
    }

    echo '<p><b>Standard PHP version:</b><br/> levenshtein("استقلال", "مستقل") = ';
    echo levenshtein("استقلال", "مستقل") . '</p>';

    echo '<p><b>Multibyte String version:</b></br> mb_levenshtein("استقلال", "مستقل") = ';
    echo mb_levenshtein("استقلال", "مستقل") . '</p>';
?>