Hierarchical Dirichlet model for document classification

  • Authors:
  • Sriharsha Veeramachaneni;Diego Sona;Paolo Avesani

  • Affiliations:
  • SRA Division, Istituto per la ricerca scientifica e tecnologica (ITC-IRST), Trento, Italy;SRA Division, Istituto per la ricerca scientifica e tecnologica (ITC-IRST), Trento, Italy;SRA Division, Istituto per la ricerca scientifica e tecnologica (ITC-IRST), Trento, Italy

  • Venue:
  • ICML '05 Proceedings of the 22nd international conference on Machine learning
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The proliferation of text documents on the web as well as within institutions necessitates their convenient organization to enable efficient retrieval of information. Although text corpora are frequently organized into concept hierarchies or taxonomies, the classification of the documents into the hierarchy is expensive in terms human effort. We present a novel and simple hierarchical Dirichlet generative model for text corpora and derive an efficient algorithm for the estimation of model parameters and the unsupervised classification of text documents into a given hierarchy. The class conditional feature means are assumed to be inter-related due to the hierarchical Bayesian structure of the model. We show that the algorithm provides robust estimates of the classification parameters by performing smoothing or regularization. We present experimental evidence on real web data that our algorithm achieves significant gains in accuracy over simpler models.