Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce

Authors:
Ke Zhai;Jordan Boyd-Graber;Nima Asadi;Mohamad L. Alkhouja
Affiliations:
University of Maryland, College Park, MD, USA;University of Maryland, College Park, MD, USA;University of Maryland, College Park, MD, USA;University of Maryland, College Park, MD, USA
Venue:
Proceedings of the 21st international conference on World Wide Web
Year:
2012

Citing 24
Cited 5

An Introduction to Variational Methods for Graphical Models

Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Monte Carlo Statistical Methods (Springer Texts in Statistics)

Monte Carlo Statistical Methods (Springer Texts in Statistics)
Variational Message Passing

The Journal of Machine Learning Research
Learning Object Categories from Google"s Image Search

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
A hierarchical Bayesian language model based on Pitman-Yor processes

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Mining business topics in source code using latent dirichlet allocation

ISEC '08 Proceedings of the 1st India software engineering conference
Fully distributed EM for very large datasets

Proceedings of the 25th international conference on Machine learning
Mixed Membership Stochastic Blockmodels

The Journal of Machine Learning Research
Graphical Models, Exponential Families, and Variational Inference

Foundations and Trends® in Machine Learning
Efficient methods for topic model inference on streaming document collections

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications

AAIM '09 Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management
Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Joint sentiment/topic model for sentiment analysis

Proceedings of the 18th ACM conference on Information and knowledge management
Polylingual topic models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
On smoothing and inference for topic models

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
An architecture for parallel topic models

Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Scalable inference in latent variable models

Proceedings of the fifth ACM international conference on Web search and data mining

MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment
Extraction of topic evolutions from references in scientific articles and its GPU acceleration

Proceedings of the 21st ACM international conference on Information and knowledge management
Sparse online topic models

Proceedings of the 22nd international conference on World Wide Web
TopicFlow: visualizing topic alignment of Twitter data over time

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Hone: "Scaling down" Hadoop on shared-memory systems

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic modeling package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs sampling, our proposed framework uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We compare the scalability of Mr. LDA against Mahout, an existing large scale topic modeling package. Mr. LDA out-performs Mahout both in execution speed and held-out likelihood.