More influence means less work: fast latent dirichlet allocation by influence scheduling

Authors:
Mirwaes Wahabzada;Kristian Kersting;Anja Pilz;Christian Bauckhage
Affiliations:
Fraunhofer Institute for Intelligent Analysis and Information Systems, Sankt Augustin, Germany;Fraunhofer Institute for Intelligent Analysis and Information Systems, Sankt Augustin, Germany;Fraunhofer Institute for Intelligent Analysis and Information Systems, Sankt Augustin, Germany;Fraunhofer Institute for Intelligent Analysis and Information Systems, Sankt Augustin, Germany
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 6
Cited 1

On an equivalence between PLSI and LDA

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Fast monte-carlo algorithms for finding low-rank approximations

Journal of the ACM (JACM)
NMF and PLSI: equivalence and a hybrid algorithm

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Stochastic search using the natural gradient

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Larger residuals, less work: active document scheduling for latent dirichlet allocation

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III

From names to entities using thematic context distance

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

There have recently been considerable advances in fast inference for (online) latent Dirichlet allocation (LDA). While it is widely recognized that the scheduling of documents in stochastic optimization and in turn in LDA may have significant consequences, this issue remains largely unexplored. Instead, practitioners schedule documents essentially uniformly at random, due perhaps to ease of implementation, and to the lack of clear guidelines on scheduling the documents. In this work, we address this issue and propose to schedule documents for an update that exert a disproportionately large influence on the topics of the corpus before less influential ones. More precisely, we justify to sample documents randomly biased towards those ones with higher norms to form mini-batches. On several real-world datasets, including 3M articles from Wikipedia and 8M from PubMed, we demonstrate that the resulting influence scheduled LDA can handily analyze massive document collections and find topic models as good or better than those found with online LDA, often at a fraction of time.