Using large monolingual and bilingual corpora to improve coordination disambiguation

Authors:
Shane Bergsma;David Yarowsky;Kenneth Church
Affiliations:
Johns Hopkins University;Johns Hopkins University;Johns Hopkins University
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Year:
2011

Citing 24
Cited 4

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Structural ambiguity and lexical relations

Computational Linguistics - Special issue on using large corpora: I
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora

Computational Linguistics
Two languages are more informative than one

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Corpus statistics meet the noun compound: some empirical results

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Automatic processing of large corpora for the resolution of anaphora references

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
Bootstrapping parsers via syntactic projection across parallel texts

Natural Language Engineering
Experiments in parallel-text based grammar induction

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Paraphrasing with bilingual parallel corpora

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Using the web as an implicit training set: application to structural ambiguity resolution

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Unsupervised Method for Parsing Coordinated Base Noun Phrases

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Linguistically motivated large-scale NLP with C&C and boxer

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Two languages are better than one (for syntactic parsing)

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised multilingual grammar induction

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Bilingually-constrained (monolingual) shift-reduce parsing

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Bitext-based resolution of German subject-object ambiguities

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Creating robust supervised classifiers via web-scale N-gram data

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Learning better monolingual models with unannotated bilingual text

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Using web-scale N-grams to improve base NP parsing performance

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics

How many multiword expressions do people know?

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Discovering factions in the computational linguistics community

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Attacking parsing bottlenecks with unlabeled data and relevant factorizations

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
How many multiword expressions do people know?

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Resolving coordination ambiguity is a classic hard problem. This paper looks at co-ordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don't do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classifiers with monolingual and bilingual features and iteratively improve them via co-training. The co-trained classifier achieves close to 96% accuracy on Treebank data and makes 20% fewer errors than a supervised system trained with Treebank annotations.