Decomposing background topics from keywords by principal component pursuit

Authors:
Kerui Min;Zhengdong Zhang;John Wright;Yi Ma
Affiliations:
Fudan University, Shanghai, China;Tsinghua University, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 13
Cited 9

Using latent semantic analysis to improve access to textual information

CHI '88 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A similarity-based probability model for latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Information Retrieval from Documents: A Survey

Information Retrieval
Zipf and Heaps Laws' Coefficients Depend on Language

CICLing '01 Proceedings of the Second International Conference on Computational Linguistics and Intelligent Text Processing
Latent dirichlet allocation

The Journal of Machine Learning Research
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
An interdisciplinary perspective on information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Learning to Rank for Information Retrieval

Foundations and Trends in Information Retrieval
Supervised semantic indexing

Proceedings of the 18th ACM conference on Information and knowledge management
Learning to rank from Bayesian decision inference

Proceedings of the 18th ACM conference on Information and knowledge management

Query by document via a decomposition-based two-level retrieval approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Emerging topic detection using dictionary learning

Proceedings of the 20th ACM international conference on Information and knowledge management
Structural analysis of network traffic matrix via relaxed principal component pursuit

Computer Networks: The International Journal of Computer and Telecommunications Networking
A fast tri-factorization method for low-rank matrix recovery and completion

Pattern Recognition
An efficient matrix factorization based low-rank representation for subspace clustering

Pattern Recognition
Active subspace: Toward scalable low-rank learning

Neural Computation
Sparkler: supporting large-scale matrix factorization

Proceedings of the 16th International Conference on Extending Database Technology
Entity linking at the tail: sparse signals, unknown entities, and phrase models

Proceedings of the 7th ACM international conference on Web search and data mining
An efficient matrix bi-factorization alternative optimization method for low-rank matrix recovery and completion

Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Low-dimensional topic models have been proven very useful for modeling a large corpus of documents that share a relatively small number of topics. Dimensionality reduction tools such as Principal Component Analysis or Latent Semantic Indexing (LSI) have been widely adopted for document modeling, analysis, and retrieval. In this paper, we contend that a more pertinent model for a document corpus as the combination of an (approximately) low-dimensional topic model for the corpus and a sparse model for the keywords of individual documents. For such a joint topic-document model, LSI or PCA is no longer appropriate to analyze the corpus data. We hence introduce a powerful new tool called Principal Component Pursuit that can effectively decompose the low-dimensional and the sparse components of such corpus data. We give empirical results on data synthesized with a Latent Dirichlet Allocation (LDA) mode to validate the new model. We then show that for real document data analysis, the new tool significantly reduces the perplexity and improves retrieval performance compared to classical baselines.