Topic difference factor extraction between two document sets and its application to text categorization

  • Authors:
  • Takahiko Kawatani

  • Affiliations:
  • Hewlett-Packard Labs Japan, Tokyo, Japan

  • Venue:
  • SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

To improve performance in text categorization, it is important to extract distinctive features for each class. This paper proposes topic difference factor analysis (TDFA) as a method to extract projection axes that reflect topic differences between two document sets. Suppose all sentence vectors that compose each document are projected onto projection axes. TDFA obtains the axes that maximize the ratio between the document sets as to the sum of squared projections by solving a generalized eigenvalue problem. The axes are called topic difference factors (TDF's). By applying TDFA to the document set that belongs to a given class and a set of documents that is misclassified as belonging to that class by an existent classifier, we can obtain features that take large values in the given class but small ones in other classes, as well as features that take large values in other classes but small ones in the given class. A classifier was constructed applying the above features to complement the kNN classifier. As the results, the micro averaged F1 measure for Reuters-21578 improved from 83.69 to 87.27%.