Efficient methods for topic model inference on streaming document collections

Authors:
Limin Yao;David Mimno;Andrew McCallum
Affiliations:
University of Massachusetts, Amherst, Amherst, MA, USA;University of Massachusetts, Amherst, Amherst, MA, USA;University of Massachusetts, Amherst, Amherst, MA, USA
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 4
Cited 32

A maximum entropy approach to natural language processing

Computational Linguistics
Latent dirichlet allocation

The Journal of Machine Learning Research
Modeling and predicting personal information dissemination behavior

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Fast collapsed gibbs sampling for latent dirichlet allocation

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Unsupervised modeling of Twitter conversations

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Streaming first story detection with application to Twitter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A latent dirichlet allocation method for selectional preferences

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Latent variable models of selectional preference

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Learning to tag from open vocabulary labels

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
An architecture for parallel topic models

Proceedings of the VLDB Endowment
Streaming cross document entity coreference resolution

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Unified analysis of streaming news

Proceedings of the 20th international conference on World wide web
Empirical study of topic modeling in Twitter

Proceedings of the First Workshop on Social Media Analytics
Interactive topic modeling

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
MPI/OpenMP hybrid parallel inference for Latent Dirichlet Allocation

Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
Mining tags using social endorsement networks

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Scalable distributed inference of dynamic user interests for behavioral targeting

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Larger residuals, less work: active document scheduling for latent dirichlet allocation

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Who is Doing What and When: Social Map-Based Recommendation for Content-Centric Social Web Sites

ACM Transactions on Intelligent Systems and Technology (TIST)
Personalized topic-based tag recommendation

Neurocomputing
Probabilistic models of similarity in syntactic context

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Named entity recognition in tweets: an experimental study

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Computational historiography: Data mining in a century of classics journals

Journal on Computing and Cultural Heritage (JOCCH)
Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce

Proceedings of the 21st international conference on World Wide Web
Improving performance of topic models by variable grouping

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Practical collapsed variational bayes inference for hierarchical dirichlet process

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast mining and forecasting of complex time-stamped events

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Open domain event extraction from twitter

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient tree-based topic modeling

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
NIFTY: a system for large scale information flow tracking and clustering

Proceedings of the 22nd international conference on World Wide Web
When relevance is not enough: promoting diversity and freshness in personalized question recommendation

Proceedings of the 22nd international conference on World Wide Web
Sparse online topic models

Proceedings of the 22nd international conference on World Wide Web
A study on document retrieval system based on visualization to manage OCR documents

HCI'13 Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV
Context-dependent conceptualization

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Fast topic discovery from web search streams

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference techniques that are computationally expensive. With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model. In this paper, we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, and a new method inspired by text classification. The classification-based inference method produces results similar to iterative inference methods, but requires only a single matrix multiplication. In addition to these inference methods, we present SparseLDA, an algorithm and data structure for evaluating Gibbs sampling distributions. Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.