Using latent semantic analysis to improve access to textual information
CHI '88 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Latent semantic indexing: a probabilistic analysis
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A similarity-based probability model for latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised learning by probabilistic latent semantic analysis
Machine Learning
Information Retrieval from Documents: A Survey
Information Retrieval
Zipf and Heaps Laws' Coefficients Depend on Language
CICLing '01 Proceedings of the Second International Conference on Computational Linguistics and Intelligent Text Processing
The Journal of Machine Learning Research
LDA-based document models for ad-hoc retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
An interdisciplinary perspective on information retrieval
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Learning to Rank for Information Retrieval
Foundations and Trends in Information Retrieval
Proceedings of the 18th ACM conference on Information and knowledge management
Learning to rank from Bayesian decision inference
Proceedings of the 18th ACM conference on Information and knowledge management
Query by document via a decomposition-based two-level retrieval approach
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Emerging topic detection using dictionary learning
Proceedings of the 20th ACM international conference on Information and knowledge management
Structural analysis of network traffic matrix via relaxed principal component pursuit
Computer Networks: The International Journal of Computer and Telecommunications Networking
A fast tri-factorization method for low-rank matrix recovery and completion
Pattern Recognition
Active subspace: Toward scalable low-rank learning
Neural Computation
Sparkler: supporting large-scale matrix factorization
Proceedings of the 16th International Conference on Extending Database Technology
Entity linking at the tail: sparse signals, unknown entities, and phrase models
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
Low-dimensional topic models have been proven very useful for modeling a large corpus of documents that share a relatively small number of topics. Dimensionality reduction tools such as Principal Component Analysis or Latent Semantic Indexing (LSI) have been widely adopted for document modeling, analysis, and retrieval. In this paper, we contend that a more pertinent model for a document corpus as the combination of an (approximately) low-dimensional topic model for the corpus and a sparse model for the keywords of individual documents. For such a joint topic-document model, LSI or PCA is no longer appropriate to analyze the corpus data. We hence introduce a powerful new tool called Principal Component Pursuit that can effectively decompose the low-dimensional and the sparse components of such corpus data. We give empirical results on data synthesized with a Latent Dirichlet Allocation (LDA) mode to validate the new model. We then show that for real document data analysis, the new tool significantly reduces the perplexity and improves retrieval performance compared to classical baselines.