An architecture for parallel topic models

Authors:
Alexander Smola;Shravan Narayanamurthy
Affiliations:
Yahoo! Research, Santa Clara, CA, and Australian National University, Canberra;Yahoo! Labs, Bangalore, India
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 5
Cited 37

Latent dirichlet allocation

The Journal of Machine Learning Research
Convex Optimization

Convex Optimization
Efficient methods for topic model inference on streaming document collections

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications

AAIM '09 Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management
The generalized distributive law

IEEE Transactions on Information Theory

Investigating topic models for social media user recommendation

Proceedings of the 20th international conference companion on World wide web
Understanding the functions of business accounts on Twitter

Proceedings of the 20th international conference companion on World wide web
Regularized latent semantic indexing

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Scalable distributed inference of dynamic user interests for behavioral targeting

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Multiple domain user personalization

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Democrats, republicans and starbucks afficionados: user classification in twitter

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Conditional topical coding: an efficient topic model conditioned on rich features

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Latent topic feedback for information retrieval

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Larger residuals, less work: active document scheduling for latent dirichlet allocation

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Scalable inference in latent variable models

Proceedings of the fifth ACM international conference on Web search and data mining
Collective context-aware topic models for entity disambiguation

Proceedings of the 21st international conference on World Wide Web
Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce

Proceedings of the 21st international conference on World Wide Web
Large scale microblog mining using distributed MB-LDA

Proceedings of the 21st international conference companion on World Wide Web
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Linear support vector machines via dual cached loops

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Large-scale distributed non-negative sparse coding and sparse dictionary learning

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Transparent user models for personalization

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
ComSoc: adaptive transfer of user behaviors over composite social network

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Social-network analysis using topic models

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
PowerGraph: distributed graph-parallel computation on natural graphs

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Efficient tree-based topic modeling

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Web-scale multi-task feature selection for behavioral targeting

Proceedings of the 21st ACM international conference on Information and knowledge management
Fully sparse topic models

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

ACM Transactions on Information Systems (TOIS)
Towards high-throughput gibbs sampling at scale: a study across storage managers

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scalable inference in max-margin topic models

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Big data analytics with small footprint: squaring the cloud

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed large-scale natural graph factorization

Proceedings of the 22nd international conference on World Wide Web
Stochastic variational inference

The Journal of Machine Learning Research
Generalized relational topic models with data augmentation

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Scalable dynamic nonparametric Bayesian models of content and users

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Scalable topic-specific influence analysis on microblogs

Proceedings of the 7th ACM international conference on Web search and data mining
Scalable hierarchical multitask learning algorithms for conversion optimization in display advertising

Proceedings of the 7th ACM international conference on Web search and data mining
User behavior learning and transfer in composite social networks

ACM Transactions on Knowledge Discovery from Data (TKDD) - Casin special issue

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics. The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.