Test Data Likelihood for PLSA Models

Authors:
Thorsten Brants
Affiliations:
Google, Inc., Mountain View, USA 94043
Venue:
Information Retrieval
Year:
2005

Citing 8
Cited 7

Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Topic-based document segmentation with probabilistic latent semantic analysis

Proceedings of the eleventh international conference on Information and knowledge management
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Inducing a semantically annotated lexicon via EM-based clustering

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Topic analysis using a finite mixture model

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Modeling Semantic Aspects for Cross-Media Image Indexing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Scene modeling in global-local view for scene classification

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Fusing semantic aspects for image annotation and retrieval

Journal of Visual Communication and Image Representation
Modeling continuous visual features for semantic image annotation and retrieval

Pattern Recognition Letters
RPLSA: A novel updating scheme for Probabilistic Latent Semantic Analysis

Computer Speech and Language
Applying latent dirichlet allocation to automatic essay grading

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Use of contexts in language model interpolation and adaptation

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

Probabilistic Latent Semantic Analysis (PLSA) is a statistical latent class model that has recently received considerable attention. In its usual formulation it cannot assign likelihoods to unseen documents. Furthermore, it assigns a probability of zero to unseen documents during training. We point out that one of the two existing alternative formulations of the Expectation-Maximization algorithms for PLSA does not require this assumption. However, even that formulation does not allow calculation ofthe actual likelihood values. We therefore derive a new test-data likelihood substitute for PLSA and compare it to three existing likelihood substitutes. An empirical evaluation shows that our new likelihood substitute produces the best predictions about accuracies in two different IR tasks and is therefore best suited to determine the number of EM steps when training PLSA models. The new likelihood measure and its evaluation also suggest that PLSA is not very sensitive to overfitting for the two tasks considered. This renders additions like tempered EM that especially address overfitting unnecessary.