The Journal of Machine Learning Research
Convex Optimization
Efficient methods for topic model inference on streaming document collections
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications
AAIM '09 Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management
The generalized distributive law
IEEE Transactions on Information Theory
Investigating topic models for social media user recommendation
Proceedings of the 20th international conference companion on World wide web
Understanding the functions of business accounts on Twitter
Proceedings of the 20th international conference companion on World wide web
Regularized latent semantic indexing
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Scalable distributed inference of dynamic user interests for behavioral targeting
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Multiple domain user personalization
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Democrats, republicans and starbucks afficionados: user classification in twitter
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Conditional topical coding: an efficient topic model conditioned on rich features
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Latent topic feedback for information retrieval
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Larger residuals, less work: active document scheduling for latent dirichlet allocation
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Scalable inference in latent variable models
Proceedings of the fifth ACM international conference on Web search and data mining
Collective context-aware topic models for entity disambiguation
Proceedings of the 21st international conference on World Wide Web
Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce
Proceedings of the 21st international conference on World Wide Web
Large scale microblog mining using distributed MB-LDA
Proceedings of the 21st international conference companion on World Wide Web
Distributed GraphLab: a framework for machine learning and data mining in the cloud
Proceedings of the VLDB Endowment
Large-scale machine learning at twitter
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Linear support vector machines via dual cached loops
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Large-scale distributed non-negative sparse coding and sparse dictionary learning
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Transparent user models for personalization
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
ComSoc: adaptive transfer of user behaviors over composite social network
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Social-network analysis using topic models
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
PowerGraph: distributed graph-parallel computation on natural graphs
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Efficient tree-based topic modeling
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Web-scale multi-task feature selection for behavioral targeting
Proceedings of the 21st ACM international conference on Information and knowledge management
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling
ACM Transactions on Information Systems (TOIS)
Towards high-throughput gibbs sampling at scale: a study across storage managers
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scalable inference in max-margin topic models
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Big data analytics with small footprint: squaring the cloud
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed large-scale natural graph factorization
Proceedings of the 22nd international conference on World Wide Web
Stochastic variational inference
The Journal of Machine Learning Research
Generalized relational topic models with data augmentation
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Scalable dynamic nonparametric Bayesian models of content and users
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Scalable topic-specific influence analysis on microblogs
Proceedings of the 7th ACM international conference on Web search and data mining
Proceedings of the 7th ACM international conference on Web search and data mining
User behavior learning and transfer in composite social networks
ACM Transactions on Knowledge Discovery from Data (TKDD) - Casin special issue
Hi-index | 0.01 |
This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics. The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.