Improving text classification accuracy using topic modeling over an additional corpus

Authors:
Somnath Banerjee
Affiliations:
Hewlett-Packard Labs India, Bangalore, India
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 4
Cited 3

Latent dirichlet allocation

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
A Bayesian Hierarchical Model for Learning Natural Scene Categories

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2

Unsupervised Feature Generation using Knowledge Repositories for Effective Text Categorization

Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
Collaborative future event recommendation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improving semi-supervised text classification by using wikipedia knowledge

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The World Wide Web has many document repositories that can act as valuable sources of additional data for various machine learning tasks. In this paper, we propose a method of improving text classification accuracy by using such an additional corpus that can easily be obtained from the web. This additional corpus can be unlabeled and independent of the given classification task. The method proposed here uses topic modeling to extract a set of topics from the additional corpus. Those extracted topics then act as additional features of the data of the given classification task. An evaluation on the RCV1 dataset shows significant improvement over a baseline method.