Accounting for burstiness in topic models

Authors:
Gabriel Doyle;Charles Elkan
Affiliations:
University of California, San Diego, La Jolla, CA;University of California, San Diego, La Jolla, CA
Venue:
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Year:
2009

Citing 6
Cited 11

Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization

ACM Transactions on Mathematical Software (TOMS)
Latent dirichlet allocation

The Journal of Machine Learning Research
A Bayesian Hierarchical Model for Learning Natural Scene Categories

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning

Modeling the evolution of associated data

Data & Knowledge Engineering
Topic models with power-law using Pitman-Yor process

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Unified analysis of streaming news

Proceedings of the 20th international conference on World wide web
A time-dependent topic model for multiple text streams

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic topic models

Communications of the ACM
Bayesian checking for topic models

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Optimizing semantic coherence in topic models

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
The generalized dirichlet distribution in enhanced topic detection

Proceedings of the 21st ACM international conference on Information and knowledge management
A partially supervised cross-collection topic model for cross-domain text classification

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Probabilistic topic models for sequence data

Machine Learning
Latent word context model for information retrieval

Information Retrieval

Quantified Score

Hi-index	0.02

Visualization

Abstract

Many different topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suffer from the important flaw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of language that if a word is used once in a document, it is more likely to be used again. We introduce a topic model that uses Dirichlet compound multinomial (DCM) distributions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA). It is straightforward to incorporate the DCM extension into topic models that are more complex than LDA.