Fast collapsed gibbs sampling for latent dirichlet allocation

  • Authors:
  • Ian Porteous;David Newman;Alexander Ihler;Arthur Asuncion;Padhraic Smyth;Max Welling

  • Affiliations:
  • University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA

  • Venue:
  • Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we introduce a novel collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model. Our new method results in significant speedups on real world text corpora. Conventional Gibbs sampling schemes for LDA require O(K) operations per sample where K is the number of topics in the model. Our proposed method draws equivalent samples but requires on average significantly less then K operations per sample. On real-word corpora FastLDA can be as much as 8 times faster than the standard collapsed Gibbs sampler for LDA. No approximations are necessary, and we show that our fast sampling scheme produces exactly the same results as the standard (but slower) sampling scheme. Experiments on four real world data sets demonstrate speedups for a wide range of collection sizes. For the PubMed collection of over 8 million documents with a required computation time of 6 CPU months for LDA, our speedup of 5.7 can save 5 CPU months of computation.