Atomic Decomposition by Basis Pursuit
SIAM Journal on Scientific Computing
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing
Communications of the ACM
The Journal of Machine Learning Research
LDA-based document models for ad-hoc retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Organizing the OCA: learning faceted subjects from a library of digital books
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Computational Statistics & Data Analysis
A Unified View of Matrix Factorization Models
ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications
AAIM '09 Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management
Double sparsity: learning sparse dictionaries for sparse signal approximation
IEEE Transactions on Signal Processing
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce
Proceedings of the 19th international conference on World wide web
An architecture for parallel topic models
Proceedings of the VLDB Endowment
Machine learning for query-document matching in search
Proceedings of the fifth ACM international conference on Web search and data mining
MadLINQ: large-scale distributed matrix computation for the cloud
Proceedings of the 7th ACM european conference on Computer Systems
Memory-restricted latent semantic analysis to accumulate term-document co-occurrence events
Pattern Recognition Letters
Group matrix factorization for scalable topic modeling
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Beyond bag-of-words: machine learning for query-document matching in web search
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Clustering short text using Ncut-weighted non-negative matrix factorization
Proceedings of the 21st ACM international conference on Information and knowledge management
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling
ACM Transactions on Information Systems (TOIS)
Group sparse topical coding: from code to topic
Proceedings of the sixth ACM international conference on Web search and data mining
A fresh perspective: learning to sparsify for detection in massive noisy sensor networks
Proceedings of the 12th international conference on Information processing in sensor networks
A general collaborative filtering framework based on matrix bordered block diagonal forms
Proceedings of the 24th ACM Conference on Hypertext and Social Media
Improve collaborative filtering through bordered block diagonal form matrices
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 22nd international conference on World Wide Web
Textual Similarity with a Bag-of-Embedded-Words Model
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Partial-update dimensionality reduction for accumulating co-occurrence events
Pattern Recognition Letters
The dual-sparse topic model: mining focused topics and focused terms in short text
Proceedings of the 23rd international conference on World wide web
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Enhanced semantic representation for improved ontology-based information retrieval
International Journal of Knowledge-based and Intelligent Engineering Systems - Selected papers of KES2012-Part 2 of 2
Hi-index | 0.00 |
Topic modeling can boost the performance of information retrieval, but its real-world application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by l₂ and/or l₁ norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l₂ norm on topics and l₁ norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.