Algorithms for clustering data
Algorithms for clustering data
Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System
Journal of the American Society for Information Science
Empirical studies in strategies for Arabic retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text-Retrieval: Theory and Practice
Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
Stemming and lemmatization in the clustering of finnish text documents
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Using N-grams for Arabic text searching
Journal of the American Society for Information Science and Technology
Automatic Arabic document categorization based on the Naïve Bayes algorithm
Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Towards an error-free Arabic stemming
Proceedings of the 2nd ACM workshop on Improving non english web searching
A comparison study of some Arabic root finding algorithms
Journal of the American Society for Information Science and Technology
An accuracy-enhanced light stemmer for arabic text
ACM Transactions on Speech and Language Processing (TSLP)
Arabic texts analysis for topic modeling evaluation
Information Retrieval
Hi-index | 0.00 |
Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming. Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language. The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more effective than stemming in mining Arabic text. We investigate the impact of our new lemmatizer on unsupervised data mining techniques in comparison to the leading Arabic stemmers. We conclude that lemmatization is a better word normalization method than stemming for Arabic text.