Fast collapsed gibbs sampling for latent dirichlet allocation

Authors:
Ian Porteous;David Newman;Alexander Ihler;Arthur Asuncion;Padhraic Smyth;Max Welling
Affiliations:
University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 11
Cited 42

Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Very fast EM-based mixture model clustering using multiresolution kd-trees

Proceedings of the 1998 conference on Advances in neural information processing systems II
Multidimensional binary search trees used for associative searching

Communications of the ACM
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
A Scalable Topic-Based Open Source Search Engine

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Subject metadata enrichment using statistical topic models

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Bayesian k-Means as a "Maximization-expectation" algorithm

Neural Computation

Efficient methods for topic model inference on streaming document collections

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications

AAIM '09 Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA Compatible Devices

IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
A Generic Approach to Topic Models

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Global models of document structure using latent permutations

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
TwitterRank: finding topic-sensitive influential twitterers

Proceedings of the third ACM international conference on Web search and data mining
Content modeling using latent permutations

Journal of Artificial Intelligence Research
Software traceability with topic modeling

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
An efficient block model for clustering sparse graphs

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Measuring distributional similarity in context

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Topic models for meaning similarity in context

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

ACM Transactions on Intelligent Systems and Technology (TIST)
Mixed-membership naive Bayes models

Data Mining and Knowledge Discovery
Mining software repositories using topic models

Proceedings of the 33rd International Conference on Software Engineering
Annotating knowledge work lifelog: term extraction from sensor and operation history

Proceedings of the 20th ACM international conference on Information and knowledge management
Collective context-aware topic models for entity disambiguation

Proceedings of the 21st international conference on World Wide Web
Mining the Semantic Web

Data Mining and Knowledge Discovery
Improving performance of topic models by variable grouping

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Fast mining and forecasting of complex time-stamped events

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
ComSoc: adaptive transfer of user behaviors over composite social network

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Review of statistical network analysis: models, algorithms, and software

Statistical Analysis and Data Mining
Multiple location profiling for users and relationships from social network and content

Proceedings of the VLDB Endowment
DRETOM: developer recommendation based on topic models for bug resolution

Proceedings of the 8th International Conference on Predictive Models in Software Engineering
DualSum: a topic-model based approach for update summarization

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
The generalized dirichlet distribution in enhanced topic detection

Proceedings of the 21st ACM international conference on Information and knowledge management
Topic-sensitive probabilistic model for expert finding in question answer communities

Proceedings of the 21st ACM international conference on Information and knowledge management
Towards high-throughput gibbs sampling at scale: a study across storage managers

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms

Proceedings of the 2013 International Conference on Software Engineering
An exploratory analysis of mobile development issues using stack overflow

Proceedings of the 10th Working Conference on Mining Software Repositories
Scalable inference in max-margin topic models

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Community detection in content-sharing social networks

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
On handling textual errors in latent document modeling

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Mining user interest from search tasks and annotations

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Topic modelling of clickthrough data in image search

Multimedia Tools and Applications
User behavior learning and transfer in composite social networks

ACM Transactions on Knowledge Discovery from Data (TKDD) - Casin special issue
Studying software evolution using topic models

Science of Computer Programming
Capturing the essence of word-of-mouth for social commerce: Assessing the quality of online e-commerce reviews by a semi-supervised approach

Decision Support Systems
Fast topic discovery from web search streams

Proceedings of the 23rd international conference on World wide web
A time-based collective factorization for topic discovery and monitoring in news

Proceedings of the 23rd international conference on World wide web
Static test case prioritization using topic models

Empirical Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we introduce a novel collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model. Our new method results in significant speedups on real world text corpora. Conventional Gibbs sampling schemes for LDA require O(K) operations per sample where K is the number of topics in the model. Our proposed method draws equivalent samples but requires on average significantly less then K operations per sample. On real-word corpora FastLDA can be as much as 8 times faster than the standard collapsed Gibbs sampler for LDA. No approximations are necessary, and we show that our fast sampling scheme produces exactly the same results as the standard (but slower) sampling scheme. Experiments on four real world data sets demonstrate speedups for a wide range of collection sizes. For the PubMed collection of over 8 million documents with a required computation time of 6 CPU months for LDA, our speedup of 5.7 can save 5 CPU months of computation.