Using the web to obtain frequencies for unseen bigrams

  • Authors:
  • Frank Keller;Mirella Lapata

  • Affiliations:
  • School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, UK;Department of Computer Science, University of Sheffield, 211 Portobello Street, Sheffield S1 4DP, UK

  • Venue:
  • Computational Linguistics - Special issue on web as corpus
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This article shows that the Web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the Web by querying a search engine. We evaluate this method by demonstrating: (a) a high correlation between Web frequencies and corpus frequencies; (b) a reliable correlation between Web frequencies and plausibility judgments; (c) a reliable correlation between Web frequencies and frequencies recreated using class-based smoothing; (d) a good performance of Web frequencies in a pseudodisambiguation task.