An Introduction to Variational Methods for Graphical Models
Machine Learning
The Journal of Machine Learning Research
Monte Carlo Statistical Methods (Springer Texts in Statistics)
Monte Carlo Statistical Methods (Springer Texts in Statistics)
The Journal of Machine Learning Research
Learning Object Categories from Google"s Image Search
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
A hierarchical Bayesian language model based on Pitman-Yor processes
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Mining business topics in source code using latent dirichlet allocation
ISEC '08 Proceedings of the 1st India software engineering conference
Fully distributed EM for very large datasets
Proceedings of the 25th international conference on Machine learning
Mixed Membership Stochastic Blockmodels
The Journal of Machine Learning Research
Graphical Models, Exponential Families, and Variational Inference
Foundations and Trends® in Machine Learning
Efficient methods for topic model inference on streaming document collections
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications
AAIM '09 Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management
Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Joint sentiment/topic model for sentiment analysis
Proceedings of the 18th ACM conference on Information and knowledge management
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
On smoothing and inference for topic models
UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
An architecture for parallel topic models
Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide
Scalable inference in latent variable models
Proceedings of the fifth ACM international conference on Web search and data mining
MapReduce algorithms for big data analysis
Proceedings of the VLDB Endowment
Extraction of topic evolutions from references in scientific articles and its GPU acceleration
Proceedings of the 21st ACM international conference on Information and knowledge management
Proceedings of the 22nd international conference on World Wide Web
TopicFlow: visualizing topic alignment of Twitter data over time
Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Hone: "Scaling down" Hadoop on shared-memory systems
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic modeling package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs sampling, our proposed framework uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We compare the scalability of Mr. LDA against Mahout, an existing large scale topic modeling package. Mr. LDA out-performs Mahout both in execution speed and held-out likelihood.