Regularized latent semantic indexing

Authors:
Quan Wang;Jun Xu;Hang Li;Nick Craswell
Affiliations:
Peking University, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft, Bellevue, WA, USA
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 16
Cited 18

Atomic Decomposition by Basis Pursuit

SIAM Journal on Scientific Computing
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Latent dirichlet allocation

The Journal of Machine Learning Research
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing

Computational Statistics & Data Analysis
A Unified View of Matrix Factorization Models

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications

AAIM '09 Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management
Double sparsity: learning sparse dictionaries for sparse signal approximation

IEEE Transactions on Signal Processing
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Proceedings of the 19th international conference on World wide web
An architecture for parallel topic models

Proceedings of the VLDB Endowment

Machine learning for query-document matching in search

Proceedings of the fifth ACM international conference on Web search and data mining
MadLINQ: large-scale distributed matrix computation for the cloud

Proceedings of the 7th ACM european conference on Computer Systems
Memory-restricted latent semantic analysis to accumulate term-document co-occurrence events

Pattern Recognition Letters
Group matrix factorization for scalable topic modeling

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Beyond bag-of-words: machine learning for query-document matching in web search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Clustering short text using Ncut-weighted non-negative matrix factorization

Proceedings of the 21st ACM international conference on Information and knowledge management
Fully sparse topic models

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

ACM Transactions on Information Systems (TOIS)
Group sparse topical coding: from code to topic

Proceedings of the sixth ACM international conference on Web search and data mining
A fresh perspective: learning to sparsify for detection in massive noisy sensor networks

Proceedings of the 12th international conference on Information processing in sensor networks
A general collaborative filtering framework based on matrix bordered block diagonal forms

Proceedings of the 24th ACM Conference on Hypertext and Social Media
Improve collaborative filtering through bordered block diagonal form matrices

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Sparse online topic models

Proceedings of the 22nd international conference on World Wide Web
Textual Similarity with a Bag-of-Embedded-Words Model

Proceedings of the 2013 Conference on the Theory of Information Retrieval
Partial-update dimensionality reduction for accumulating co-occurrence events

Pattern Recognition Letters
The dual-sparse topic model: mining focused topics and focused terms in short text

Proceedings of the 23rd international conference on World wide web
Improved Semantic Retrieval of Spoken Content by Document/Query Expansion with Random Walk Over Acoustic Similarity Graphs

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Enhanced semantic representation for improved ontology-based information retrieval

International Journal of Knowledge-based and Intelligent Engineering Systems - Selected papers of KES2012-Part 2 of 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topic modeling can boost the performance of information retrieval, but its real-world application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by l₂ and/or l₁ norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l₂ norm on topics and l₁ norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.