An empirical study of required dimensionality for large-scale latent semantic indexing applications

Authors:
Roger B. Bradford
Affiliations:
Agilex Technologies, Chantilly, VA, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 26
Cited 6

Improving text retrieval for the routing problem using latent semantic indexing

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Incorporating latent semantic indexing into a neural network model for information retrieval

CIKM '96 Proceedings of the fifth international conference on Information and knowledge management
Translingual information retrieval: learning from bilingual corpora

Artificial Intelligence - Special issue: artificial intelligence 40 years later
Approximate Dimension Equalization in Vector-based Information Retrieval

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Taking a new look at the latent semantic analysis approach to information retrieval

Computational information retrieval
On the use of the singular value decomposition for text retrieval

Computational information retrieval
Experiments with LSA scoring: optimal rank and basis

Computational information retrieval
A comparative analysis of LSI strategies

Computational information retrieval
Cross-Language Information Retrieval Using Latent Semantic Indexing

Cross-Language Information Retrieval Using Latent Semantic Indexing
Measuring praise and criticism: Inference of semantic orientation from association

ACM Transactions on Information Systems (TOIS)
Using latent semantic indexing to filter spam

Proceedings of the 2003 ACM symposium on Applied computing
Locality preserving indexing for document representation

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
SVM-Based feature selection of latent semantic features

Pattern Recognition Letters
Eigenvalue-based model selection during latent semantic indexing: Research Articles

Journal of the American Society for Information Science and Technology
Why spectral retrieval works

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Multi-label informed latent semantic indexing

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Disambiguating noun compounds with latent semantic indexing

COMPUTERM '02 COLING-02 on COMPUTERM 2002: second international workshop on computational terminology - Volume 14
Information Technology & Lawyers: Advanced technology in the legal domain, from challenges to daily routine

Information Technology & Lawyers: Advanced technology in the legal domain, from challenges to daily routine
A framework for understanding latent semantic indexing (LSI) performance

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Advanced learning algorithms for cross-language patent retrieval and classification

Information Processing and Management: an International Journal
Essential Dimensions of Latent Semantic Indexing (LSI)

HICSS '07 Proceedings of the 40th Annual Hawaii International Conference on System Sciences
Automatic dimensionality selection from the scree plot via the use of profile likelihood

Computational Statistics & Data Analysis
An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Modeling information scent: a comparison of LSA, PMI and GLSA similarity measures on common tests and corpora

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Application of latent semantic indexing in generating graphs of terrorist networks

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
Semi-automatic construction of topic ontologies

EWMF'05/KDO'05 Proceedings of the 2005 joint international conference on Semantics, Web and Mining

Wise search engine based on LSI

ADMI'10 Proceedings of the 6th international conference on Agents and data mining interaction
Comparability of LSI and human judgment in text analysis tasks

MMACTEE'09 Proceedings of the 11th WSEAS international conference on Mathematical methods and computational techniques in electrical engineering
Latent semantic indexing (LSI) fails for TREC collections

ACM SIGKDD Explorations Newsletter
Is singular value decomposition useful for word similarity extraction?

Language Resources and Evaluation
Implementation techniques for large-scale latent semantic indexing applications

Proceedings of the 20th ACM international conference on Information and knowledge management
Selecting corpus-semantic models for neurolinguistic decoding

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The technique of latent semantic indexing is used in a wide variety of commercial applications. In these applications, the processing time and RAM required for SVD computation, and the processing time and RAM required during LSI retrieval operations are all roughly linear in the number of dimensions, k, chosen for the LSI representation space. In large-scale commercial LSI applications, reducing k values could be of significant value in reducing server costs. This paper explores the effects of varying dimensionality. The approach taken here focuses on term comparisons. Pairs of terms are considered which have strong real-world associations. The proximities of members of these pairs in the LSI space are compared at multiple values of k. The testing is carried out for collections of from one to five million documents. For the five million document collection, a value of k ≈ 400 provides the best performance. The results suggest that there is something of an 'island of stability' in the k = 300 to 500 range. The results also indicate that there is relatively little room to employ k values outside of this range without incurring significant distortions in at least some term-term correlations.