Distributed, large-scale latent semantic analysis by index interpolation

Authors:
Sebastiano Vigna
Affiliations:
DSI, Università degli Studi di Milano, Italy
Venue:
Proceedings of the 3rd international conference on Scalable information systems
Year:
2008

Citing 15
Cited 3

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Matrix multiplication via arithmetic progressions

Journal of Symbolic Computation - Special issue on computational algebraic complexity
Implicit application of polynomial filters in a k-step Arnoldi method

SIAM Journal on Matrix Analysis and Applications
Rectangular matrix multiplication revisited

Journal of Complexity
The structural cause of file size distributions

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Algorithms in C: Parts 1-4, Fundamentals, Data Structures, Sorting, and Searching

Algorithms in C: Parts 1-4, Fundamentals, Data Structures, Sorting, and Searching
Telcordia LSI Engine: Implementation Scalability and Issues

Eleventh International Workshop on Research Issues in Data Engineering on Document Management for Data Intensive Business and Scientific Applications
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Clustered SVD strategies in latent semantic indexing

Information Processing and Management: an International Journal
Why spectral retrieval works

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs)

Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs)
Type less, find more: fast autocompletion search with a succinct index

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Out-of-core SVD performance for document indexing

Applied Numerical Mathematics
Introduction to Information Retrieval

Introduction to Information Retrieval

ParaText: scalable text modeling and analysis

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Subspace tracking for latent semantic analysis

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
DLPR: a distributed locality preserving dimension reduction algorithm

IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Latent semantic analysis [12] is a well-known technique to extrapolate concepts from a set of documents; it discards noise by reducing the rank of (a variant of) the term/document matrix of a document collection by singular value decomposition. The latter is performed by solving an equivalent symmetric eigenvector problem on a related matrix. Scaling to large set of documents, however, is problematic because every vector-matrix multiplication required by iterative solvers requires a number of multiplications equal to twice the number of postings of the collection. We show how to combine standard search-engine algorithmic tools in such a way to compute (reasonably) quickly the cooccurrence matrix C of a large document collection, and solve directly the associated symmetric eigenvector problem. Albeit the size of C is quadratic in the number of terms, we can distribute its computation among any number of computational unit without increasing the overall number of multiplications. Moreover, our approach is advantageous when the document collection is large, because the number of terms over which latent semantic analysis has to be performed is inherently limited by the size of a language lexicon. We present experiments over a collection with 3.6 billions of postings---two orders of magnitudes larger than any published experiment in the literature.