Web-based and combined language models: a case study on noun compound identification

  • Authors:
  • Carlos Ramisch;Aline Villavicencio;Christian Boitet

  • Affiliations:
  • University of Grenoble and Federal University of Rio Grande do Sul;Federal University of Rio Grande do Sul;University of Grenoble

  • Venue:
  • COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper looks at the web as a corpus and at the effects of using web counts to model language, particularly when we consider them as a domain-specific versus a general-purpose resource. We first compare three vocabularies that were ranked according to frequencies drawn from general-purpose, specialised and web corpora. Then, we look at methods to combine heterogeneous corpora and evaluate the individual and combined counts in the automatic extraction of noun compounds from English general-purpose and specialised texts. Better n-gram counts can help improve the performance of empirical NLP systems that rely on n-gram language models.