Arabic texts analysis for topic modeling evaluation

Authors:
Abderrezak Brahmi;Ahmed Ech-Cherif;Abdelkader Benyettou
Affiliations:
Department of Computer Science, University of Abdelhamid Ibn Badis, Mostaganem, Algeria;Department of Computer Science, USTO-MB, Oran, Algeria;Department of Computer Science, USTO-MB, Oran, Algeria
Venue:
Information Retrieval
Year:
2012

Citing 10
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
Method for evaluation of stemming algorithms based on error counting

Journal of the American Society for Information Science
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Strength and similarity of affix removal stemming algorithms

ACM SIGIR Forum
Language-specific models in multilingual topic tracking

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Arabic Stemming Without A Root Dictionary

ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume I - Volume 01
A novel Arabic lemmatization algorithm

Proceedings of the second workshop on Analytics for noisy unstructured text data
Examining the effect of improved context sensitive morphology on Arabic information retrieval

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Significant progress has been made in information retrieval covering text semantic indexing and multilingual analysis. However, developments in Arabic information retrieval did not follow the extraordinary growth of Arabic usage in the Web during the ten last years. In the tasks relating to semantic analysis, it is preferable to directly deal with texts in their original language. Studies on topic models, which provide a good way to automatically deal with semantic embedded in texts, are not complete enough to assess the effectiveness of the approach on Arabic texts. This paper investigates several text stemming methods for Arabic topic modeling. A new lemma-based stemmer is described and applied to newspaper articles. The Latent Dirichlet Allocation model is used to extract latent topics from three Arabic real-world corpora. For supervised classification in the topics space, experiments show an improvement when comparing to classification in the full words space or with root-based stemming approach. In addition, topic modeling with lemma-based stemming allows us to discover interesting subjects in the press articles published during the 2007---2009 period.