Best terms: an efficient feature-selection algorithm for text categorization

  • Authors:
  • Dimitris Fragoudis;Dimitris Meretakis;Spiridon Likothanassis_aff1n3

  • Affiliations:
  • Computer Engineering and Informatics Department, University of Patras, GR-26500, Rio—Patras, Greece and aff3 Computer Technology Institute, Patras, Greece;Novartis Pharma, Griffith University, GR-26500, Basel, Switzerland;Computer Engineering and Informatics Department, University of Patras, GR-26500, Rio—Patras, Greece and aff3 Computer Technology Institute, Patras, Greece

  • Venue:
  • Knowledge and Information Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the number of categories. We evaluate BT on two benchmark document collections, Reuters-21578 and 20-Newsgroups, using two classification algorithms, naive Bayes (NB) and support vector machines (SVM). Our experimental results, comparing BT with an extensive and representative list of feature-selection algorithms, show that (1) BT is faster than the existing feature-selection algorithms; (2) BT leads to a considerable increase in the classification accuracy of NB and SVM as measured by the F1 measure; (3) BT leads to a considerable improvement in the speed of NB and SVM; in most cases, the training time of SVM has dropped by an order of magnitude; (4) in most cases, the combination of BT with the simple, but very fast, NB algorithm leads to classification accuracy comparable with SVM while sometimes it is even more accurate.