Distributed, large-scale latent semantic analysis by index interpolation

  • Authors:
  • Sebastiano Vigna

  • Affiliations:
  • DSI, Università degli Studi di Milano, Italy

  • Venue:
  • Proceedings of the 3rd international conference on Scalable information systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Latent semantic analysis [12] is a well-known technique to extrapolate concepts from a set of documents; it discards noise by reducing the rank of (a variant of) the term/document matrix of a document collection by singular value decomposition. The latter is performed by solving an equivalent symmetric eigenvector problem on a related matrix. Scaling to large set of documents, however, is problematic because every vector-matrix multiplication required by iterative solvers requires a number of multiplications equal to twice the number of postings of the collection. We show how to combine standard search-engine algorithmic tools in such a way to compute (reasonably) quickly the cooccurrence matrix C of a large document collection, and solve directly the associated symmetric eigenvector problem. Albeit the size of C is quadratic in the number of terms, we can distribute its computation among any number of computational unit without increasing the overall number of multiplications. Moreover, our approach is advantageous when the document collection is large, because the number of terms over which latent semantic analysis has to be performed is inherently limited by the size of a language lexicon. We present experiments over a collection with 3.6 billions of postings---two orders of magnitudes larger than any published experiment in the literature.