A topic identification task for modern standard Arabic

Authors:
Mourad Abbas;Daoud Berkani
Affiliations:
Signal and Communication Laboratory, Polytechnic National School, Algiers, Algeria;Signal and Communication Laboratory, Polytechnic National School, Algiers, Algeria
Venue:
ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers
Year:
2006

Citing 8
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
A fuzzy decision strategy for topic identification and dynamic selection of language models

Signal Processing - Special issue on fuzzy logic in signal processing
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
TopCat: Data Mining for Topic Identification in a Text Corpus

IEEE Transactions on Knowledge and Data Engineering
Improved topic-dependent language modeling using information retrieval techniques

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present two well-known categorization methods and their use in topic identification for Modern Standard Arabic. The first one is the TFIDF approach, and the second is a Support Vector Machines (SVM) based classifier. In the best of our knowledge, we do not know several precedent works on Arabic topic identification, which is the task we investigate in this article. The corpus we used is extracted from the daily Arabic newspaper 'Alkhabar', which includes 6000 news articles, corresponding to nearly 3 millions of words covering the topics: local news, sport, international news and economy. According to our experiments, the results are encouraging both for SVM and TFIDF classifier, particularly if we compare them with those found for French language. However we have noticed the superiority of the SVM classifier and its high capability to distinguish topics. In addition we have shown the effect of vocabulary size in results enhancement. Other experiments have been conducted to determine the minimum of words necessary in order to achieve acceptable results. This is particularly interesting for speech recognition when language adaptation is needed.