Modeling information scent: a comparison of LSA, PMI and GLSA similarity measures on common tests and corpora

Authors:
Raluca Budiu;Christiaan Royer;Peter Pirolli
Affiliations:
Palo Alto Research Center, Palo Alto, CA;Palo Alto Research Center, Palo Alto, CA;Palo Alto Research Center, Palo Alto, CA
Venue:
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Year:
2007

Citing 14
Cited 4

WordNet: a lexical database for English

Communications of the ACM
Information visualization

Readings in information visualization
Foundations of statistical natural language processing

Foundations of statistical natural language processing
The effect of information scent on searching information: visualizations of large tree structures

AVI '00 Proceedings of the working conference on Advanced visual interfaces
Contextual correlates of synonymy

Communications of the ACM
Placing search in context: the concept revisited

ACM Transactions on Information Systems (TOIS)
The bloodhound project: automating discovery of web usability issues using the InfoScentπ simulator

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Co-occurrence vectors from corpora vs. distance vectors from dictionaries

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Tool for accurately predicting website navigation problems, non-problems, problem severity, and effectiveness of repairs

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A comparison of LSA, wordNet and PMI-IR for predicting user click behavior

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Frequency estimates for statistical word similarity measures

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Navigation in degree of interest trees

Proceedings of the working conference on Advanced visual interfaces
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1

An empirical study of required dimensionality for large-scale latent semantic indexing applications

Proceedings of the 17th ACM conference on Information and knowledge management
The microstructures of social tagging: a rational model

Proceedings of the 2008 ACM conference on Computer supported cooperative work
Automated semantic elaboration of web site information architecture

Interacting with Computers
Distributional phrasal paraphrase generation for statistical machine translation

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe a comparison among three systems that estimate semantic similarity between words: Latent Semantic Analysis (Landauer & Dumais, 1997), Pointwise Mutual Information (Turney, 2001), and Generalized Latent Semantic Analysis (Matveeva, Levow, Farahat, & Royer, 2005). We compare all these techniques on a unique corpus (TASA) and, for PMI and GLSA, we also report performance on a larger web-based corpus. The evaluation is carried out through two kinds of tests: (1) synonymy tests, and (2) comparison with human word similarity judgments. The results indicate that for large corpora PMI works best on word similarity tests, and GLSA on synonymy tests. For the smaller TASA corpus, GLSA produced the best performance on most tests. A large corpus improved the performance of PMI, but, in most cases, did not improve that of GLSA.