Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

  • Authors:
  • Charles Elkan

  • Affiliations:
  • Massachusetts Institute of Technology, Cambridge MA and University of California, San Diego, La Jolla, CA

  • Venue:
  • ICML '06 Proceedings of the 23rd international conference on Machine learning
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a model for text documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. We derive a new family of distributions that are approximations to DCM distributions and constitute an exponential family, unlike DCM distributions. We use these so-called EDCM distributions to obtain insights into the properties of DCM distributions, and then derive an algorithm for EDCM maximum-likelihood training that is many times faster than the corresponding method for DCM distributions. Next, we investigate expectation-maximization with EDCM components and deterministic annealing as a new clustering algorithm for documents. Experiments show that the new algorithm is competitive with the best methods in the literature, and superior from the point of view of finding models with low perplexity.