Topic model validation

Authors:
Eduardo H. Ramirez;Ramon Brena;Davide Magatti;Fabio Stella
Affiliations:
Tecnologico de Monterrey, Campus Monterrey, Monterrey, Mexico;Tecnologico de Monterrey, Campus Monterrey, Monterrey, Mexico;DISCo, Universití degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milano, Italy;DISCo, Universití degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milano, Italy
Venue:
Neurocomputing
Year:
2012

Citing 7
Cited 4

Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
On External Measures for Validation of Fuzzy Partitions

IFSA '07 Proceedings of the 12th international Fuzzy Systems Association world congress on Foundations of Fuzzy Logic and Soft Computing
The Unreasonable Effectiveness of Data

IEEE Intelligent Systems
Evaluation methods for topic models

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Editorial: Special issue on advances in web intelligence

Neurocomputing
Weak signal identification with semantic web mining

Expert Systems with Applications: An International Journal
OClustR: A new graph-based algorithm for overlapping clustering

Neurocomputing
Quantitative cross impact analysis with latent semantic indexing

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper the problem of performing external validation of the semantic coherence of topic models is considered. The Fowlkes-Mallows index, a known clustering validation metric, is generalized for the case of overlapping partitions and multi-labeled collections, thus making it suitable for validating topic modeling algorithms. In addition, we propose new probabilistic metrics inspired by the concepts of recall and precision. The proposed metrics also have clear probabilistic interpretations and can be applied to validate and compare other soft and overlapping clustering algorithms. The approach is exemplified by using the Reuters-21578 multi-labeled collection to validate LDA models, then using Monte Carlo simulations to show the convergence to the correct results. Additional statistical evidence is provided to better understand the relation of the metrics presented.