Classification of skewed and homogenous document corpora with class-based and corpus-based keywords

  • Authors:
  • Arzucan Özgür;Tunga Güngör

  • Affiliations:
  • Boǧaziçi University, Computer Engineering Department, Bebek, Istanbul, Turkey;Boǧaziçi University, Computer Engineering Department, Bebek, Istanbul, Turkey

  • Venue:
  • KI'06 Proceedings of the 29th annual German conference on Artificial intelligence
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we examine the performance of the two policies for keyword selection over standard document corpora of varying properties. While in corpus-based policy a single set of keywords is selected for all classes globally, in class-based policy a distinct set of keywords is selected for each class locally. We use SVM as the learning method and perform experiments with boolean and tf-idf weighting. In contrast to the common belief, we show that using keywords instead of all words generally yields better performance and tf-idf weighting does not always outperform boolean weighting. Our results reveal that corpus-based approach performs better for large number of keywords while class-based approach performs better for small number of keywords. In skewed datasets, class-based keyword selection performs consistently better than corpus-based approach in terms of macro-averaged F-measure. In homogenous datasets, performances of class-based and corpus-based approaches are similar except for small number of keywords.