Distributed Algorithms for Topic Models

Authors:
David Newman;Arthur Asuncion;Padhraic Smyth;Max Welling
Affiliations:
-;-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2009

Citing 11
Cited 32

Synchronous Random Fields and Image Restoration

IEEE Transactions on Pattern Analysis and Machine Intelligence
Distributed data clustering can be efficient and exact

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Latent dirichlet allocation

The Journal of Machine Learning Research
The author-topic model for authors and documents

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs)

Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs)
Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,

Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Fully distributed EM for very large datasets

Proceedings of the 25th international conference on Machine learning

A hybrid unsupervised image re-ranking approach with latent topic contents

Proceedings of the ACM International Conference on Image and Video Retrieval
Variational inference for adaptor grammars

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A latent dirichlet allocation method for selectional preferences

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

ACM Transactions on Intelligent Systems and Technology (TIST)
Topic chains for understanding a news corpus

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Deciphering foreign language

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
An unsupervised model for joint phrase alignment and extraction

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Scalable distributed inference of dynamic user interests for behavioral targeting

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Conditional topical coding: an efficient topic model conditioned on rich features

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A time-dependent topic model for multiple text streams

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

The Journal of Machine Learning Research
Larger residuals, less work: active document scheduling for latent dirichlet allocation

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling

ACM Transactions on Intelligent Systems and Technology (TIST)
Scalable inference in latent variable models

Proceedings of the fifth ACM international conference on Web search and data mining
Collective context-aware topic models for entity disambiguation

Proceedings of the 21st international conference on World Wide Web
Large scale microblog mining using distributed MB-LDA

Proceedings of the 21st international conference companion on World Wide Web
Large-scale distributed non-negative sparse coding and sparse dictionary learning

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Large scale decipherment for out-of-domain machine translation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
G-WSTD: a framework for geographic web search topic discovery

Proceedings of the 21st ACM international conference on Information and knowledge management
Finding nuggets in IP portfolios: core patent mining through textual temporal analysis

Proceedings of the 21st ACM international conference on Information and knowledge management
Fully sparse topic models

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Full Length Article: A low-cost variational-Bayes technique for merging mixtures of probabilistic principal component analyzers

Information Fusion
Scalable inference in max-margin topic models

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed large-scale natural graph factorization

Proceedings of the 22nd international conference on World Wide Web
Stochastic variational inference

The Journal of Machine Learning Research
Detecting non-gaussian geographical topics in tagged photo collections

Proceedings of the 7th ACM international conference on Web search and data mining
Spatial compactness meets topical consistency: jointly modeling links and content for community detection

Proceedings of the 7th ACM international conference on Web search and data mining
Scalable topic-specific influence analysis on microblogs

Proceedings of the 7th ACM international conference on Web search and data mining
Supervised N-gram topic model

Proceedings of the 7th ACM international conference on Web search and data mining
Fast topic discovery from web search streams

Proceedings of the 23rd international conference on World wide web
Discovery of clinical pathway patterns from event logs using probabilistic topic models

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer.