Variable latent semantic indexing

  • Authors:
  • Anirban Dasgupta;Ravi Kumar;Prabhakar Raghavan;Andrew Tomkins

  • Affiliations:
  • Cornell University, Ithaca, NY;IBM Almaden Research Center, San Jose, CA;Yahoo!, Research Labs, Sunnyvale, CA;IBM Almaden Research Center, San Jose, CA

  • Venue:
  • Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Latent Semantic Indexing is a classical method to produce optimal low-rank approximations of a term-document matrix. However, in the context of a particular query distribution, the approximation thus produced need not be optimal. We propose VLSI, a new query-dependent (or "variable") low-rank approximation that minimizes approximation error for any specified query distribution. With this tool, it is possible to tailor the LSI technique to particular settings, often resulting in vastly improved approximations at much lower dimensionality. We validate this method via a series of experiments on classical corpora, showing that VLSI typically performs similarly to LSI with an order of magnitude fewer dimensions.