Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA

Authors:
Yue Lu;Qiaozhu Mei;Chengxiang Zhai
Affiliations:
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA 61801;School of Information, University of Michigan, Ann Arbor, USA 48109;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA 61801
Venue:
Information Retrieval
Year:
2011

Citing 21
Cited 7

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Model-based feedback in the language modeling approach to information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
On an equivalence between PLSI and LDA

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Probabilistic author-topic models for information discovery

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A cross-collection mixture model for comparative text mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Relation between PLSA and NMF and implications

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Topics over time: a non-Markov continuous-time model of topical trends

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A mixture model for contextual text mining

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Topic sentiment mixture: modeling facets and opinions in weblogs

Proceedings of the 16th international conference on World Wide Web
Topic modeling with network regularization

Proceedings of the 17th international conference on World Wide Web
Joint latent topic models for text and citations

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Modeling hidden topics on document manifold

Proceedings of the 17th ACM conference on Information and knowledge management
A Comparative Study of Utilizing Topic Models for Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Evaluation methods for topic models

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Expectation-propagation for the generative aspect model

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

Latent topic feedback for information retrieval

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

ACM Transactions on Information Systems (TOIS)
Monolingual and cross-lingual probabilistic topic models and their applications in information retrieval

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Leveraging relevance cues for language modeling in speech recognition

Information Processing and Management: an International Journal
Unsupervised latent concept modeling to identify query facets

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
The dual-sparse topic model: mining focused topics and focused terms in short text

Proceedings of the 23rd international conference on World wide web
Latent word context model for information retrieval

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported promising performance of these topic models, none of the work has systematically investigated the task performance of topic models; as a result, some critical questions that may affect the performance of all applications of topic models are mostly unanswered, particularly how to choose between competing models, how multiple local maxima affect task performance, and how to set parameters in topic models. In this paper, we address these questions by conducting a systematic investigation of two representative probabilistic topic models, probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval. The analysis of our experimental results provides deeper understanding of topic models and many useful insights about how to optimize the performance of topic models for these typical tasks. The task-based evaluation framework is generalizable to other topic models in the family of either PLSA or LDA.