Text categorization based on subtopic clusters

  • Authors:
  • Francis C. Y. Chik;Robert W. P. Luk;Korris F. L. Chung

  • Affiliations:
  • Department of Computing, Hong Kong Polytechnic University;Department of Computing, Hong Kong Polytechnic University;Department of Computing, Hong Kong Polytechnic University

  • Venue:
  • NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The distribution of the number of documents in topic classes is typically highly skewed. This leads to good micro-average performance but not so desirable macro-average performance. By viewing topics as clusters in a high dimensional space, we propose the use of clustering to determine subtopic clusters for large topic classes by assuming that large topic clusters are in general a mixture of a number of subtopic clusters. We used the Reuters News articles and support vector machines to evaluate whether using subtopic cluster can lead to better macro-average performance.