Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

  • Authors:
  • Ramesh Nallapati;William Cohen;John Lafferty

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Statistical topic models such as the Latent Dirichlet Al- location (LDA) have emerged as an attractive framework to model, visualize and summarize large document collec- tions in a completely unsupervised fashion. Considering the enormous sizes of the modern electronic document col- lections, it is very important that these models are fast and scalable. In this work, we build parallel implementations of the variational EM algorithm for LDA in a multiproces- sor architecture as well as a distributed setting. Our ex- periments on various sized document collections indicate that while both the implementations achieve speed-ups, the distributed version achieves dramatic improvements in both speed and scalability. We also analyze the costs associated with various stages of the EM algorithm and suggest ways to further improve the performance.