Feature selection in SVM text categorization

  • Authors:
  • Hirotoshi Taira;Masahiko Haruno

  • Affiliations:
  • -;-

  • Venue:
  • AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
  • Year:
  • 1999

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper investigates the effect of prior feature selection in Support Vector Machine (SVM) text categorization. The input space was gradually increased by using mutual information (MI) filtering and part-of-speech (POS) filtering, which determine the portion of words that are appropriate for learning from the information-theoretic and the linguistic perspectives, respectively. We tested the two filtering methods on SVMs as well as a decision tree algorithm C4.5. The SVMs' results common to both filtering are that 1) the optimal number of features differed completely across categories, and 2) the average performance for all categories was best when all of the words were used. In addition, a comparison of the two filtering methods clarified that POS filtering on SVMs consistently outperformed MI filtering, which indicates that SVMs cannot find irrelevant parts of speech. These results suggest a simple strategy for the SVM text categorization: use a full number of words found through a rough filtering technique like part-of-speech tagging.