Latent semantic indexing (LSI) fails for TREC collections

Authors:
Avinash Atreya;Charles Elkan
Affiliations:
University of California, San Diego;University of California, San Diego
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2011

Citing 9
Cited 3

Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Fusion Via a Linear Combination of Scores

Information Retrieval
Approximate Dimension Equalization in Vector-based Information Retrieval

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Term norm distribution and its effects on latent semantic indexing

Information Processing and Management: an International Journal
Regularizing ad hoc retrieval scores

Proceedings of the 14th ACM international conference on Information and knowledge management
Essential Dimensions of Latent Semantic Indexing (LSI)

HICSS '07 Proceedings of the 40th Annual Hawaii International Conference on System Sciences
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
An empirical study of required dimensionality for large-scale latent semantic indexing applications

Proceedings of the 17th ACM conference on Information and knowledge management
Improvements that don't add up: ad-hoc retrieval results since 1998

Proceedings of the 18th ACM conference on Information and knowledge management

Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

ACM Transactions on Information Systems (TOIS)
Cross-lingual random indexing for information retrieval

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing
Improved Semantic Retrieval of Spoken Content by Document/Query Expansion with Random Walk Over Acoustic Similarity Graphs

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The aim of latent semantic indexing (LSI) is to uncover the relationships between terms, hidden concepts, and documents. LSI uses the matrix factorization technique known as singular value decomposition (SVD). In this paper, we apply LSI to standard benchmark collections. We find that LSI yields poor retrieval accuracy on the TREC 2, 7, 8, and 2004 collections. We believe that the negative result is robust, because we try more LSI variants than any previous work. First, we show that using Okapi BM25 weights for terms in documents improves the performance of LSI. Second, we derive novel scoring methods that implement the ideas of query expansion and score regularization in the LSI framework. Third, we show how to combine the BM25 method with LSI methods. All proposed methods are evaluated experimentally on the four TREC collections mentioned above. The experiments show that the new variants of LSI improve upon previous LSI methods. Nevertheless, no way of using LSI achieves a worthwhile improvement in retrieval accuracy over BM25.