Trading spaces: on the lore and limitations of latent semantic analysis

Authors:
Eduard Hoenkamp
Affiliations:
Queensland University of Technology, Brisbane, Australia
Venue:
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Year:
2011

Citing 15
Cited 0

WordNet: a lexical database for English

Communications of the ACM
Translingual information retrieval: learning from bilingual corpora

Artificial Intelligence - Special issue: artificial intelligence 40 years later
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Unitary operators on the document space

Journal of the American Society for Information Science and Technology - Mathematical, logical, and formal methods in information retrieval
On an equivalence between PLSI and LDA

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
The document as an ergodic markov chain

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Information Organization and Retrieval.

Automatic Information Organization and Retrieval.
Random Projections of Smooth Manifolds

Foundations of Computational Mathematics
TDM modeling and evaluation of different domain transforms for LSI

Neurocomputing
An Effective Approach to Verbose Queries Using a Limited Dependencies Language Model

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Kernel latent semantic analysis using an information retrieval based kernel

Proceedings of the 18th ACM conference on Information and knowledge management
A fingerprinting technique for evaluating semantics based indexing

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Decoding by linear programming

IEEE Transactions on Information Theory
Compressed sensing

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Two decades after its inception, Latent Semantic Analysis (LSA) has become part and parcel of every modern introduction to IR. For any tool that matures so quickly, it is important to check its lore and limitations, or else stagnation will set in. We focus here on the three main aspects of LSA that are well accepted, and the gist of which can be summarized as follows: (1) that LSA recovers latent semantic factors underlying the document space, (2) that such can be accomplished through lossy compression of the document space by eliminating lexical noise, and (3) that the latter can best be achieved by Singular Value Decomposition. For each aspect we performed experiments analogous to those reported in the LSA literature and compared the evidence brought to bear in each case. On the negative side, we show that the above claims about LSA are much more limited than commonly believed. Even a simple example may show that LSA does not recover the optimal semantic factors as intended in the pedagogical example used in many LSA publications. Additionally, and remarkably deviating from LSA lore, LSA does not scale up well: the larger the document space, the more unlikely that LSA recovers an optimal set of semantic factors. On the positive side, we describe new algorithms to replace LSA (and more recent alternatives as pLSA, LDA, and kernel methods) by trading its l2 space for an l1 space, thereby guaranteeing an optimal set of semantic factors. These algorithms seem to salvage the spirit of LSA as we think it was initially conceived.