Group matrix factorization for scalable topic modeling

Authors:
Quan Wang;Zheng Cao;Jun Xu;Hang Li
Affiliations:
MOE-Microsoft Key Laboratory of Statistics&Information Technology, Peking University, Beijing, China;Dept. of Computer Science & Engineering, Shanghai JiaoTong University, Shanghai, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 11
Cited 0

Atomic Decomposition by Basis Pursuit

SIAM Journal on Scientific Computing
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Latent dirichlet allocation

The Journal of Machine Learning Research
A cross-collection mixture model for comparative text mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Classification-enhanced ranking

Proceedings of the 19th international conference on World wide web
Regularized latent semantic indexing

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Partially labeled topic models for interpretable text mining

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic topic models

Communications of the ACM

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topic modeling can reveal the latent structure of text data and is useful for knowledge discovery, search relevance ranking, document classification, and so on. One of the major challenges in topic modeling is to deal with large datasets and large numbers of topics in real-world applications. In this paper, we investigate techniques for scaling up the non-probabilistic topic modeling approaches such as RLSI and NMF. We propose a general topic modeling method, referred to as Group Matrix Factorization (GMF), to enhance the scalability and efficiency of the non-probabilistic approaches. GMF assumes that the text documents have already been categorized into multiple semantic classes, and there exist class-specific topics for each of the classes as well as shared topics across all classes. Topic modeling is then formalized as a problem of minimizing a general objective function with regularizations and/or constraints on the class-specific topics and shared topics. In this way, the learning of class-specific topics can be conducted in parallel, and thus the scalability and efficiency can be greatly improved. We apply GMF to RLSI and NMF, obtaining Group RLSI (GRLSI) and Group NMF (GNMF) respectively. Experiments on a Wikipedia dataset and a real-world web dataset, each containing about 3 million documents, show that GRLSI and GNMF can greatly improve RLSI and NMF in terms of scalability and efficiency. The topics discovered by GRLSI and GNMF are coherent and have good readability. Further experiments on a search relevance dataset, containing 30,000 labeled queries, show that the use of topics learned by GRLSI and GNMF can significantly improve search relevance.