Improving text classification accuracy using topic modeling over an additional corpus

  • Authors:
  • Somnath Banerjee

  • Affiliations:
  • Hewlett-Packard Labs India, Bangalore, India

  • Venue:
  • Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The World Wide Web has many document repositories that can act as valuable sources of additional data for various machine learning tasks. In this paper, we propose a method of improving text classification accuracy by using such an additional corpus that can easily be obtained from the web. This additional corpus can be unlabeled and independent of the given classification task. The method proposed here uses topic modeling to extract a set of topics from the additional corpus. Those extracted topics then act as additional features of the data of the given classification task. An evaluation on the RCV1 dataset shows significant improvement over a baseline method.