A novel Arabic lemmatization algorithm

  • Authors:
  • Eiman Al-Shammari;Jessica Lin

  • Affiliations:
  • Kuwait University, Fairfax, VA;George Mason University, Fairfax, VA

  • Venue:
  • Proceedings of the second workshop on Analytics for noisy unstructured text data
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming. Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language. The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more effective than stemming in mining Arabic text. We investigate the impact of our new lemmatizer on unsupervised data mining techniques in comparison to the leading Arabic stemmers. We conclude that lemmatization is a better word normalization method than stemming for Arabic text.