Introduction to the special issue on the web as corpus
Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams
Computational Linguistics - Special issue on web as corpus
MARSYAS: a framework for audio analysis
Organised Sound
Web-based models for natural language processing
ACM Transactions on Speech and Language Processing (TSLP)
Computational Linguistics
The GENIA corpus: an annotated research abstract corpus in molecular biology domain
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Using the web as an implicit training set: application to noun compound syntax and semantics
Using the web as an implicit training set: application to noun compound syntax and semantics
Interpretation of compound nominalisations using corpus and web statistics
MWE '06 Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties
Using small random samples for the manual evaluation of statistical association measures
Computer Speech and Language
Web-scale N-gram models for lexical disambiguation
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Multiword expressions in the wild?: the mwetoolkit comes in handy
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations
Detecting noun compounds and light verb constructions: a contrastive study
MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Fast and flexible MWE candidate generation with the mwetoolkit
MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Hi-index | 0.00 |
This paper looks at the web as a corpus and at the effects of using web counts to model language, particularly when we consider them as a domain-specific versus a general-purpose resource. We first compare three vocabularies that were ranked according to frequencies drawn from general-purpose, specialised and web corpora. Then, we look at methods to combine heterogeneous corpora and evaluate the individual and combined counts in the automatic extraction of noun compounds from English general-purpose and specialised texts. Better n-gram counts can help improve the performance of empirical NLP systems that rely on n-gram language models.