Document classification by topic labeling

  • Authors:
  • Swapnil Hingmire;Sandeep Chougule;Girish K. Palshikar;Sutanu Chakraborti

  • Affiliations:
  • Tata Consultancy Services, Pune, India;Tata Consultancy Services, Pune, India;Tata Consultancy Services, Pune, India;IIT Madras, Chennai, India

  • Venue:
  • Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose Latent Dirichlet Allocation (LDA) [1] based document classification algorithm which does not require any labeled dataset. In our algorithm, we construct a topic model using LDA, assign one topic to one of the class labels, aggregate all the same class label topics into a single topic using the aggregation property of the Dirichlet distribution and then automatically assign a class label to each unlabeled document depending on its "closeness" to one of the aggregated topics. We present an extension to our algorithm based on the combination of Expectation-Maximization (EM) algorithm and a naive Bayes classifier. We show effectiveness of our algorithm on three real world datasets.