A very very large corpus doesn't always yield reliable estimates

Authors:
James R. Curran;Miles Osborne
Affiliations:
University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom
Venue:
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Year:
2002

Citing 5
Cited 1

A model of lexical attraction and repulsion

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Scaling context space

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Using the web to overcome data sparseness

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10

Language independent NER using a maximum entropy tagger

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4

Quantified Score

Hi-index	0.00

Visualization

Abstract

Banko and Brill (2001) suggested that the development of very large training corpora may be more effective for progress in empirical Natural Language Processing than improving methods that use existing smaller training corpora.This work tests their claim by exploring whether a very large corpus can eliminate the sparseness problems associated with estimating unigram probabilities. We do this by empirically investigating the convergence behaviour of unigram probability estimates on a one billion word corpus. When using one billion words, as expected, we do find that many of our estimates do converge to their eventual value. However, we also find that for some words, no such convergence occurs. This leads us to conclude that simply relying upon large corpora is not in itself sufficient: we must pay attention to the statistical modelling as well.