Feature reduction techniques for Arabic text categorization

  • Authors:
  • Rehab Duwairi;Mohammad Nayef Al-Refai;Natheer Khasawneh

  • Affiliations:
  • Department of Computer Information Systems, Jordan University of Science and Technology, Irbid, Jordan;Department of Computer Science, Jordan University of Science and Technology, Irbid, Jordan;Department of Computer Engineering/ Jordan University of Science and Technology, Irbid, Jordan

  • Venue:
  • Journal of the American Society for Information Science and Technology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents and compares three feature reduction techniques that were applied to Arabic text. The techniques include stemming, light stemming, and word clusters. The effects of the aforementioned techniques were studied and analyzed on the K-nearest-neighbor classifier. Stemming reduces words to their stems. Light stemming, by comparison, removes common affixes from words without reducing them to their stems. Word clusters group synonymous words into clusters and each cluster is represented by a single word. The purpose of employing the previous methods is to reduce the size of document vectors without affecting the accuracy of the classifiers. The comparison metric includes size of document vectors, classification time, and accuracy (in terms of precision and recall). Several experiments were carried out using four different representations of the same corpus: the first version uses stem-vectors, the second uses light stem-vectors, the third uses word clusters, and the fourth uses the original words (without any transformation) as representatives of documents. The corpus consists of 15,000 documents that fall into three categories: sports, economics, and politics. In terms of vector sizes and classification time, the stemmed vectors consumed the smallest size and the least time necessary to classify a testing dataset that consists of 6,000 documents. The light stemmed vectors superseded the other three representations in terms of classification accuracy. © 2009 Wiley Periodicals, Inc.