A novel Arabic lemmatization algorithm

Authors:
Eiman Al-Shammari;Jessica Lin
Affiliations:
Kuwait University, Fairfax, VA;George Mason University, Fairfax, VA
Venue:
Proceedings of the second workshop on Analytics for noisy unstructured text data
Year:
2008

Citing 8
Cited 4

Algorithms for clustering data

Algorithms for clustering data
Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System

Journal of the American Society for Information Science
Empirical studies in strategies for Arabic retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text-Retrieval: Theory and Practice

Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
Stemming and lemmatization in the clustering of finnish text documents

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Using N-grams for Arabic text searching

Journal of the American Society for Information Science and Technology
Automatic Arabic document categorization based on the Naïve Bayes algorithm

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages

Towards an error-free Arabic stemming

Proceedings of the 2nd ACM workshop on Improving non english web searching
A comparison study of some Arabic root finding algorithms

Journal of the American Society for Information Science and Technology
An accuracy-enhanced light stemmer for arabic text

ACM Transactions on Speech and Language Processing (TSLP)
Arabic texts analysis for topic modeling evaluation

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming. Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language. The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more effective than stemming in mining Arabic text. We investigate the impact of our new lemmatizer on unsupervised data mining techniques in comparison to the leading Arabic stemmers. We conclude that lemmatization is a better word normalization method than stemming for Arabic text.