Implementation techniques for large-scale latent semantic indexing applications

  • Authors:
  • Roger B. Bradford

  • Affiliations:
  • Agilex Technologies Inc., Chantilly, VA, USA

  • Venue:
  • Proceedings of the 20th ACM international conference on Information and knowledge management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The technique of latent semantic indexing (LSI) has wide applicability in information retrieval and data mining tasks. To date, however, most applications of LSI have addressed relatively small collections of data. This has been due partly to hardware and software limitations and partly to overly pessimistic estimates of the processing requirements of the singular value decomposition (SVD) process. In recent years, advances in hardware capabilities and software implementations have enabled much larger LSI applications. Moreover, experience with large LSI indexes has shown that the SVD is not the limitation on scalability that it was long thought to be. This paper describes techniques applicable to creating large-scale (multi-million document) LSI indexes. Detailed data regarding the LSI index creation process is presented for collections of up to 100 million documents. Four key factors are shown to contribute to the scalability of LSI. First, in most situations, the time required for calculation of the singular value decomposition (SVD) of the term-document matrix is not the dominant factor determining the overall time required to build an LSI index. Second, the time required to calculate the SVD in LSI is linear in the number of objects indexed. Third, incremental index creation greatly facilitates use of LSI in dynamic environments. Fourth, distributed query processing can be employed to support large numbers of users. It is shown that LSI is well-suited for implementation in modern distributed computing environments. This paper provides the first measurements of the execution time for large-scale LSI build processes in a cloud environment.