Similarity-Based Models of Word Cooccurrence Probabilities

  • Authors:
  • Ido Dagan;Lillian Lee;Fernando C. N. Pereira

  • Affiliations:
  • Dept. of Mathematics and Computer Science, Bar Ilan University, Ramat Gan 52900, Israel. dagan@macs.biu.ac.il;Department of Computer Science, Cornell University, Ithaca, NY 14853, USA. llee@cs.cornell.edu;AT&T Labs—Research, 180 Park Ave., Florham Park, NJ 07932, USA. pereira@research.att.com

  • Venue:
  • Machine Learning - Special issue on natural language learning
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

In many applications of natural language processing (NLP) itis necessary to determine the likelihood of a given word combination.For example, a speech recognizer may need to determine which of thetwo word combinations “eat a peach” and ”eat a beach” is morelikely. Statistical NLP methods determine the likelihood of a wordcombination from its frequency in a training corpus. However, thenature of language is such that many word combinations are infrequentand do not occur in any given corpus. In this work we propose amethod for estimating the probability of such previously unseen wordcombinations using available information on “most similar” words.We describe probabilistic word association models based ondistributional word similarity, and apply them to two tasks, languagemodeling and pseudo-word disambiguation. In the language modelingtask, a similarity-based model is used to improve probabilityestimates for unseen bigrams in a back-off language model. Thesimilarity-based method yields a 20% perplexity improvement in theprediction of unseen bigrams and statistically significant reductionsin speech-recognition error.We also compare four similarity-based estimation methods againstback-off and maximum-likelihood estimation methods on a pseudo-wordsense disambiguation task in which we controlled for both unigram andbigram frequency to avoid giving too much weight to easy-to-disambiguate high-frequency configurations. The similarity-based methods perform up to 40% better on this particular task.