Fully sparse topic models

Authors:
Khoat Than;Tu Bao Ho
Affiliations:
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan;Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan, John von Neumann Institute, Vietnam National University, HCM, Vietnam
Venue:
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Year:
2012

Citing 12
Cited 0

Key concepts in model selection: performance and generalizability

Journal of Mathematical Psychology
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Smooth minimization of non-smooth functions

Mathematical Programming: Series A and B
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications

AAIM '09 Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management
Distributed Algorithms for Topic Models

The Journal of Machine Learning Research
Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm

ACM Transactions on Algorithms (TALG)
An architecture for parallel topic models

Proceedings of the VLDB Endowment
Regularized latent semantic indexing

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Large Linear Classification When Data Cannot Fit in Memory

ACM Transactions on Knowledge Discovery from Data (TKDD)
An optimal method for stochastic composite optimization

Mathematical Programming: Series A and B

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose Fully Sparse Topic Model (FSTM) for modeling large collections of documents. Three key properties of the model are: (1) the inference algorithm converges in linear time, (2) learning of topics is simply a multiplication of two sparse matrices, (3) it provides a principled way to directly trade off sparsity of solutions against inference quality and running time. These properties enable us to speedily learn sparse topics and to infer sparse latent representations of documents, and help significantly save memory for storage. We show that inference in FSTM is actually MAP inference with an implicit prior. Extensive experiments show that FSTM can perform substantially better than various existing topic models by different performance measures. Finally, our parallel implementation can handily learn thousands of topics from large corpora with millions of terms.