Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents

  • Authors:
  • Bassam H. Hammo

  • Affiliations:
  • King Abdullah II School for Information Technology, University of Jordan, Amman, Jordan 11942

  • Venue:
  • Information Retrieval
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The majority of Arabic text available on the web is written without short vowels (diacritics). Diacritics are commonly used in religious scripts such as the holy Quran (the book of Islam), Al-Hadith (the teachings of Prophet Mohammad (PBUH)), children's literature, and in some words where ambiguity of articulation might arise. Internet Arabic users might lose credible sources of Arabic text to be retrieved if they could not match the correct diacritical marks attached to the words in the collection. However, typing the diacritical marks is very annoying and time consuming. The other way around, is to ignore these marks and fall into the problem of ambiguity. Previous work suggested pre-processing of Arabic text to remove these diacritical marks before indexing. Consequently, there are noticeable discrepancies when searching the web for Arabic text using international search engines such as Google and yahoo. In this article, we propose a framework to enhance the retrieval effectiveness of search engines to search for diacritic and diacritic-less Arabic text through query expansion techniques. We used a rule-based stemmer and a semantic relational database compiled in an experimental thesaurus to do the expansion. We tested our approach on the scripts of the Quran. We found that query expansion for searching Arabic text is promising and it is likely that the efficiency can be further improved by advanced natural language processing tools.