A topic identification task for modern standard Arabic

  • Authors:
  • Mourad Abbas;Daoud Berkani

  • Affiliations:
  • Signal and Communication Laboratory, Polytechnic National School, Algiers, Algeria;Signal and Communication Laboratory, Polytechnic National School, Algiers, Algeria

  • Venue:
  • ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we present two well-known categorization methods and their use in topic identification for Modern Standard Arabic. The first one is the TFIDF approach, and the second is a Support Vector Machines (SVM) based classifier. In the best of our knowledge, we do not know several precedent works on Arabic topic identification, which is the task we investigate in this article. The corpus we used is extracted from the daily Arabic newspaper 'Alkhabar', which includes 6000 news articles, corresponding to nearly 3 millions of words covering the topics: local news, sport, international news and economy. According to our experiments, the results are encouraging both for SVM and TFIDF classifier, particularly if we compare them with those found for French language. However we have noticed the superiority of the SVM classifier and its high capability to distinguish topics. In addition we have shown the effect of vocabulary size in results enhancement. Other experiments have been conducted to determine the minimum of words necessary in order to achieve acceptable results. This is particularly interesting for speech recognition when language adaptation is needed.