Effective measures of domain similarity for parsing

Authors:
Barbara Plank;Gertjan van Noord
Affiliations:
University of Groningen, The Netherlands;University of Groningen, The Netherlands
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Year:
2011

Citing 20
Cited 2

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Latent dirichlet allocation

The Journal of Machine Learning Research
The domain dependence of parsing

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
More accurate tests for the statistical significance of result differences

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Sample Selection for Statistical Parsing

Computational Linguistics
Non-projective dependency parsing using spanning tree algorithms

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Effective self-training for parsing

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
CoNLL-X shared task on multilingual dependency parsing

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Domain adaptation with structural correspondence learning

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Automatic prediction of parser accuracy

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Genre distinctions for discourse in the Penn TreeBank

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Automatic domain adaptation for parsing

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Ensemble models for dependency parsing: cheap and good?

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Fine-grained genre classification using structural learning algorithms

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Intelligent selection of language model training data

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Grammar-driven versus data-driven: which parsing system is more affected by domain shifts?

NLPLING '10 Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground
Using domain similarity for performance estimation

DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Uptraining for accurate deterministic question parsing

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Exploring variations across biomedical subdomains

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Adapting a probabilistic disambiguation model of an HPSG parser to a new domain

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Sentence-level instance-weighting for graph-based and transition-based dependency parsing

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Biographies or blenders: which resource is best for cross-domain sentiment analysis?

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available, automatic ways to select data that is beneficial for a new (unknown) target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective -- it outperforms random data selection on both languages examined, English and Dutch. Moreover, the technique works better than manually assigned labels gathered from meta-data that is available for English.