Arabic SQL Query

Arabic Case Insensitive In Database Systems

1. Use Regular Expressions in Arabic SQL Queries:

Build WHERE condition for SQL statement using MySQL REGEXP and Arabic lexical rules.

With the exception of the Qur'an and pedagogical texts, Arabic is generally written without vowels or other graphic symbols that indicate how a word is pronounced. The reader is expected to fill these in from context. Some of the graphic symbols include sukuun, which is placed over a consonant to indicate that it is not followed by a vowel; shadda, written over a consonant to indicate it is doubled; and hamza, the sign of the glottal stop, which can be written above or below (alif) at the beginning of a word, or on (alif), (waaw), (yaa'), or by itself on the line elsewhere. Also, common spelling differences regularly appear, including the use of (haa') for (taa' marbuuta) and (alif maqsuura) for (yaa'). These features of written Arabic, which are also seen in Hebrew as well as other languages written with Arabic script (such as Farsi, Pashto, and Urdu), make analyzing and searching texts quite challenging. In addition, Arabic morphology and grammar are quite rich and present some unique issues for information retrieval applications.

There are essentially three ways to search an Arabic text with Arabic queries: literal, stem-based or root-based.

A literal search, the simplest search and retrieval method, matches documents based on the search terms exactly as the user entered them. The advantage of this technique is that the documents returned will without a doubt contain the exact term for which the user is looking. But this advantage is also the biggest disadvantage: many, if not most, of the documents containing the terms in different forms will be missed. Given the many ambiguities of written Arabic, the success rate of this method is quite low. For example, if the user searches for (kitaab, book), he or she will not find documents that only contain (al-kitaabu, the book).

Stem-based searching, a more complicated method, requires some normalization of the original texts and the queries. This is done by removing the vowel signs, unifying the hamza forms and removing or standardizing the other signs. Additionally, grammatical affixes and other constructions which attach directly to words, such as conjunctions, prepositions, and the definite article, should be identified and removed. Finally, regular and irregular plural forms need to be identified and reduced to their singular forms. Performing this type of stemming leads to more successful searches, but can be problematic due to over-generation or incorrect generation of stems.

A third method for searching Arabic texts is to index and search for the root forms of each word. Since most verbs and nouns in Arabic are derived from triliteral (or, rarely, quadriliteral) roots, identifying the underlying root of each word theoretically retrieves most of the documents containing a given search term regardless of form. However, there are some significant challenges with this approach. Determining the root for a given word is extremely difficult, since it requires a detailed morphological, syntactic and semantic analysis of the text to fully disambiguate the root forms. The issue is complicated further by the fact that not all words are derived from roots. For example, loan words (words borrowed from another language) are not based on root forms, although there are even exceptions to this rule. For example, some loans that have a structure similar to triliteral roots, such as the English word film, are handled grammatically as if they were root-based, adding to the complexity of this type of search. Finally, the root can serve as the foundation for a wide variety of words with related meanings. The root (k-t-b) is used for many words related to writing, including (kataba, to write), (kitaab, book), (maktab, office), and (kaatib, author). But the same root is also used for regiment/battalion, (katiiba). As a result, searching based on root forms results in very high recall, but precision is usually quite low.

While search and retrieval of Arabic text will never be an easy task, relying on linguistic analysis tools and methods can help make the process more successful. Ultimately, the search method you choose should depend on how critical it is to retrieve every conceivable instance of a word or phrase and the resources you have to process search returns in order to determine their true relevance.

Reference: Volume 13 Issue 7 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 319 North First Ave., Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.

2. Add a Normalized Field(s) to the Database Table(s):

This solution is independent of the database system. It should work even if you changed the DBS for any reason. However, adding an additional column(s) to our table(s) and processing some data will be required. The idea is simple, add a new column and fill it with the Arabic text in a "normalized form", then use the normalized column in your queries.

You can use setNorm and arNormalizeText functions to perform this task. The next step will be adding a new column to your table and filling it with this normalized version of your Arabic text content. Now we have your normalized data. How do we use it to solve our problem?

If the user searches for something, we will pass this text to the same normalize function first, which will return the normalized version of the text, then we will query our normalized column and display the original column content in our search results. So in short, we added the normalized field to the table, passed the search string to our normalize function, searched for the normalized name, and displayed the original name. This is a more modular solution but it requires more work.

Reference: Arabic Case Insensitive In Database Systems: How To Solve Alef With and Without Hamza Problem.

Example Output 1:

نتائج البحث عن فلسطينيون:
فلسطينيون فلسطيني فلسطينية فلسطينيتين فلسطينيين فلسطينيان فلسطينيات فلسطينيوا

صيغة استعلام قاعدة البيانات (SQL Query Statement)

SELECT `field` FROM `table` WHERE (field LIKE '%فلسطيني%' AND REGEXP '((ا|أ|إ|آ)ل)?فلسطيني(ون)?') ORDER BY ((CASE WHEN (field LIKE '%فلسطيني%' AND REGEXP '((ا|أ|إ|آ)ل)?فلسطيني(ون)?') THEN 1 ELSE 0 END)) DESC

Example Code 1:

<?php
    $Arabic = new \ArPHP\I18N\Arabic();
        
    echo 'نتائج البحث عن <b>فلسطينيون</b>:<br />';
    echo $Arabic->arQueryAllForms('فلسطينيون');
    
    $keyword = 'فلسطينيون';
    $keyword = str_replace('\"', '"', $keyword);

    $Arabic->setQueryStrFields('field');
    $Arabic->setQueryMode(0);  // 0 for any word, 1 for all words

    $strCondition = $Arabic->arQueryWhereCondition($keyword);
    $strOrderBy   = $Arabic->arQueryOrderBy($keyword);

    $StrSQL = "SELECT `field` FROM `table` WHERE $strCondition ORDER BY $strOrderBy";
    
    echo '<hr />صيغة استعلام قاعدة البيانات <span dir="ltr">(SQL Query Statement)</span><br />';
    echo '<pre dir="ltr" style="background-color: #e0e0e0; padding: 5px">' . $StrSQL . '</pre>';

Related Documentation: arQueryAllForms, setQueryStrFields, setQueryMode, arQueryWhereCondition, arQueryOrderBy

Example Output 2:

Origenal Text

آسِفـــةٌ لا تَنَبُّؤْ 456

Normalized Text

اسفه لا تنبء 456
اسفه لا تنبء ٤٥٦

Example Code 2:

<?php
    $Arabic = new \ArPHP\I18N\Arabic();
    
    $text = 'آسِفـــةٌ لا تَنَبُّؤْ 456';

    $Arabic->setNorm('stripTatweel', true)
           ->setNorm('stripTanween', true)
           ->setNorm('stripShadda', true)
           ->setNorm('stripLastHarakat', true)
           ->setNorm('stripWordHarakat', true)
           ->setNorm('normaliseLamAlef', true)
           ->setNorm('normaliseAlef', true)
           ->setNorm('normaliseHamza', true)
           ->setNorm('normaliseTaa', true);

    # you can also use all form like the following example
    # $Arabic->setNorm('all', true)->setNorm('normaliseHamza', false)->setNorm('normaliseTaa', false);

    echo '<b>Origenal Text</b>';
    echo '<p dir="rtl" align="justify">';
    echo $text . '</p>';

    echo '<hr /><b>Normalized Text</b>';
    echo '<p dir="rtl" align="justify">';
    echo $Arabic->arNormalizeText($text) . '<br/>';
    echo $Arabic->arNormalizeText($text, 'Hindu') . '</p>';

Related Documentation: setNorm, arNormalizeText