Key concepts in model selection: performance and generalizability
Journal of Mathematical Psychology
Unsupervised learning by probabilistic latent semantic analysis
Machine Learning
The Journal of Machine Learning Research
Smooth minimization of non-smooth functions
Mathematical Programming: Series A and B
LIBLINEAR: A Library for Large Linear Classification
The Journal of Machine Learning Research
PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications
AAIM '09 Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management
Distributed Algorithms for Topic Models
The Journal of Machine Learning Research
Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm
ACM Transactions on Algorithms (TALG)
An architecture for parallel topic models
Proceedings of the VLDB Endowment
Regularized latent semantic indexing
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Large Linear Classification When Data Cannot Fit in Memory
ACM Transactions on Knowledge Discovery from Data (TKDD)
An optimal method for stochastic composite optimization
Mathematical Programming: Series A and B
Hi-index | 0.00 |
In this paper, we propose Fully Sparse Topic Model (FSTM) for modeling large collections of documents. Three key properties of the model are: (1) the inference algorithm converges in linear time, (2) learning of topics is simply a multiplication of two sparse matrices, (3) it provides a principled way to directly trade off sparsity of solutions against inference quality and running time. These properties enable us to speedily learn sparse topics and to infer sparse latent representations of documents, and help significantly save memory for storage. We show that inference in FSTM is actually MAP inference with an implicit prior. Extensive experiments show that FSTM can perform substantially better than various existing topic models by different performance measures. Finally, our parallel implementation can handily learn thousands of topics from large corpora with millions of terms.