Web-based and combined language models: a case study on noun compound identification

Authors:
Carlos Ramisch;Aline Villavicencio;Christian Boitet
Affiliations:
University of Grenoble and Federal University of Rio Grande do Sul;Federal University of Rio Grande do Sul;University of Grenoble
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Year:
2010

Citing 10
Cited 3

Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
MARSYAS: a framework for audio analysis

Organised Sound
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
Googleology is Bad Science

Computational Linguistics
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Using the web as an implicit training set: application to noun compound syntax and semantics

Using the web as an implicit training set: application to noun compound syntax and semantics
Interpretation of compound nominalisations using corpus and web statistics

MWE '06 Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties
Using small random samples for the manual evaluation of statistical association measures

Computer Speech and Language
Web-scale N-gram models for lexical disambiguation

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence

Multiword expressions in the wild?: the mwetoolkit comes in handy

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations
Detecting noun compounds and light verb constructions: a contrastive study

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Fast and flexible MWE candidate generation with the mwetoolkit

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper looks at the web as a corpus and at the effects of using web counts to model language, particularly when we consider them as a domain-specific versus a general-purpose resource. We first compare three vocabularies that were ranked according to frequencies drawn from general-purpose, specialised and web corpora. Then, we look at methods to combine heterogeneous corpora and evaluate the individual and combined counts in the automatic extraction of noun compounds from English general-purpose and specialised texts. Better n-gram counts can help improve the performance of empirical NLP systems that rely on n-gram language models.