Modeling information scent: a comparison of LSA, PMI and GLSA similarity measures on common tests and corpora

  • Authors:
  • Raluca Budiu;Christiaan Royer;Peter Pirolli

  • Affiliations:
  • Palo Alto Research Center, Palo Alto, CA;Palo Alto Research Center, Palo Alto, CA;Palo Alto Research Center, Palo Alto, CA

  • Venue:
  • Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we describe a comparison among three systems that estimate semantic similarity between words: Latent Semantic Analysis (Landauer & Dumais, 1997), Pointwise Mutual Information (Turney, 2001), and Generalized Latent Semantic Analysis (Matveeva, Levow, Farahat, & Royer, 2005). We compare all these techniques on a unique corpus (TASA) and, for PMI and GLSA, we also report performance on a larger web-based corpus. The evaluation is carried out through two kinds of tests: (1) synonymy tests, and (2) comparison with human word similarity judgments. The results indicate that for large corpora PMI works best on word similarity tests, and GLSA on synonymy tests. For the smaller TASA corpus, GLSA produced the best performance on most tests. A large corpus improved the performance of PMI, but, in most cases, did not improve that of GLSA.