Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce

  • Authors:
  • Ke Zhai;Jordan Boyd-Graber;Nima Asadi;Mohamad L. Alkhouja

  • Affiliations:
  • University of Maryland, College Park, MD, USA;University of Maryland, College Park, MD, USA;University of Maryland, College Park, MD, USA;University of Maryland, College Park, MD, USA

  • Venue:
  • Proceedings of the 21st international conference on World Wide Web
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic modeling package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs sampling, our proposed framework uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We compare the scalability of Mr. LDA against Mahout, an existing large scale topic modeling package. Mr. LDA out-performs Mahout both in execution speed and held-out likelihood.