Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Matrix multiplication via arithmetic progressions
Journal of Symbolic Computation - Special issue on computational algebraic complexity
Implicit application of polynomial filters in a k-step Arnoldi method
SIAM Journal on Matrix Analysis and Applications
Rectangular matrix multiplication revisited
Journal of Complexity
The structural cause of file size distributions
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Algorithms in C: Parts 1-4, Fundamentals, Data Structures, Sorting, and Searching
Algorithms in C: Parts 1-4, Fundamentals, Data Structures, Sorting, and Searching
Telcordia LSI Engine: Implementation Scalability and Issues
Eleventh International Workshop on Research Issues in Data Engineering on Document Management for Data Intensive Business and Scientific Applications
Simple BM25 extension to multiple weighted fields
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Clustered SVD strategies in latent semantic indexing
Information Processing and Management: an International Journal
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs)
Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs)
Type less, find more: fast autocompletion search with a succinct index
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Out-of-core SVD performance for document indexing
Applied Numerical Mathematics
Introduction to Information Retrieval
Introduction to Information Retrieval
ParaText: scalable text modeling and analysis
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Subspace tracking for latent semantic analysis
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
DLPR: a distributed locality preserving dimension reduction algorithm
IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
Hi-index | 0.00 |
Latent semantic analysis [12] is a well-known technique to extrapolate concepts from a set of documents; it discards noise by reducing the rank of (a variant of) the term/document matrix of a document collection by singular value decomposition. The latter is performed by solving an equivalent symmetric eigenvector problem on a related matrix. Scaling to large set of documents, however, is problematic because every vector-matrix multiplication required by iterative solvers requires a number of multiplications equal to twice the number of postings of the collection. We show how to combine standard search-engine algorithmic tools in such a way to compute (reasonably) quickly the cooccurrence matrix C of a large document collection, and solve directly the associated symmetric eigenvector problem. Albeit the size of C is quadratic in the number of terms, we can distribute its computation among any number of computational unit without increasing the overall number of multiplications. Moreover, our approach is advantageous when the document collection is large, because the number of terms over which latent semantic analysis has to be performed is inherently limited by the size of a language lexicon. We present experiments over a collection with 3.6 billions of postings---two orders of magnitudes larger than any published experiment in the literature.