Text categorization with class-based and corpus-based keyword selection

  • Authors:
  • Arzucan Özgür;Levent Özgür;Tunga Güngör

  • Affiliations:
  • Department of Computer Engineering, Boğaziçi University, Bebek, İstanbul, Turkey;Department of Computer Engineering, Boğaziçi University, Bebek, İstanbul, Turkey;Department of Computer Engineering, Boğaziçi University, Bebek, İstanbul, Turkey

  • Venue:
  • ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we examine the use of keywords in text categorization with SVM. In contrast to the usual belief, we reveal that using keywords instead of all words yields better performance both in terms of accuracy and time. Unlike the previous studies that focus on keyword selection metrics, we compare the two approaches for keyword selection. In corpus-based approach, a single set of keywords is selected for all classes. In class-based approach, a distinct set of keywords is selected for each class. We perform the experiments with the standard Reuters-21578 dataset, with both boolean and tf-idf weighting. Our results show that although tf-idf weighting performs better, boolean weighting can be used where time and space resources are limited. Corpus-based approach with 2000 keywords performs the best. However, for small number of keywords, class-based approach outperforms the corpus-based approach with the same number of keywords.