Implementation techniques for large-scale latent semantic indexing applications

Authors:
Roger B. Bradford
Affiliations:
Agilex Technologies Inc., Chantilly, VA, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 5
Cited 0

Telcordia LSI Engine: Implementation and Scalability Issues

RIDE '01 Proceedings of the 11th International Workshop on research Issues in Data Engineering
SVDPACKC (Version 1.0) User''s Guide

SVDPACKC (Version 1.0) User''s Guide
Very low-dimensional latent semantic indexing for local query regions

AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
Introduction to Information Retrieval

Introduction to Information Retrieval
An empirical study of required dimensionality for large-scale latent semantic indexing applications

Proceedings of the 17th ACM conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The technique of latent semantic indexing (LSI) has wide applicability in information retrieval and data mining tasks. To date, however, most applications of LSI have addressed relatively small collections of data. This has been due partly to hardware and software limitations and partly to overly pessimistic estimates of the processing requirements of the singular value decomposition (SVD) process. In recent years, advances in hardware capabilities and software implementations have enabled much larger LSI applications. Moreover, experience with large LSI indexes has shown that the SVD is not the limitation on scalability that it was long thought to be. This paper describes techniques applicable to creating large-scale (multi-million document) LSI indexes. Detailed data regarding the LSI index creation process is presented for collections of up to 100 million documents. Four key factors are shown to contribute to the scalability of LSI. First, in most situations, the time required for calculation of the singular value decomposition (SVD) of the term-document matrix is not the dominant factor determining the overall time required to build an LSI index. Second, the time required to calculate the SVD in LSI is linear in the number of objects indexed. Third, incremental index creation greatly facilitates use of LSI in dynamic environments. Fourth, distributed query processing can be employed to support large numbers of users. It is shown that LSI is well-suited for implementation in modern distributed computing environments. This paper provides the first measurements of the execution time for large-scale LSI build processes in a cloud environment.