Larger residuals, less work: active document scheduling for latent dirichlet allocation

Authors:
Mirwaes Wahabzada;Kristian Kersting
Affiliations:
Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany;Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany
Venue:
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Year:
2011

Citing 15
Cited 2

Active data clustering

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
On an equivalence between PLSI and LDA

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Fast monte-carlo algorithms for finding low-rank approximations

Journal of the ACM (JACM)
Relation between PLSA and NMF and implications

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition

SIAM Journal on Computing
The rate adapting poisson model for information retrieval and object recognition

ICML '06 Proceedings of the 23rd international conference on Machine learning
NMF and PLSI: equivalence and a hybrid algorithm

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Less is More: Sparse Graph Mining with Compact Matrix Decomposition

Statistical Analysis and Data Mining
Stochastic search using the natural gradient

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Efficient methods for topic model inference on streaming document collections

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed Algorithms for Topic Models

The Journal of Machine Learning Research
An architecture for parallel topic models

Proceedings of the VLDB Endowment
Active Spectral Clustering

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining

More influence means less work: fast latent dirichlet allocation by influence scheduling

Proceedings of the 20th ACM international conference on Information and knowledge management
Stochastic variational inference

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, there have been considerable advances in fast inference for latent Dirichlet allocation (LDA). In particular, stochastic optimization of the variational Bayes (VB) objective function with a natural gradient step was proved to converge and able to process massive document collections. To reduce noise in the gradient estimation, it considers multiple documents chosen uniformly at random. While it is widely recognized that the scheduling of documents in stochastic optimization may have significant consequences, this issue remains largely unexplored. In this work, we address this issue. Specifically, we propose residual LDA, a novel, easy-to-implement, LDA approach that schedules documents in an informed way. Intuitively, in each iteration, residual LDA actively selects documents that exert a disproportionately large influence on the current residual to compute the next update. On several real-world datasets, including 3M articles from Wikipedia, we demonstrate that residual LDA can handily analyze massive document collections and find topic models as good or better than those found with batch VB and randomly scheduled VB, and significantly faster.