Using domain similarity for performance estimation

Authors:
Vincent Van Asch;Walter Daelemans
Affiliations:
University of Antwerp, Antwerp, Belgium;University of Antwerp, Antwerp, Belgium
Venue:
DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Year:
2010

Citing 7
Cited 3

The theory of parsing, translation, and compiling

The theory of parsing, translation, and compiling
Kernel-Based Object Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
CLAWS4: the tagging of the British National Corpus

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Memory-Based Language Processing (Studies in Natural Language Processing)

Memory-Based Language Processing (Studies in Natural Language Processing)
Measuring language divergence by intra-lexical comparison

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Detecting change in data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Domain adaptation for statistical classifiers

Journal of Artificial Intelligence Research

Effective measures of domain similarity for parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Biographies or blenders: which resource is best for cross-domain sentiment analysis?

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Cross-genre and cross-domain detection of semantic uncertainty

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many natural language processing (NLP) tools exhibit a decrease in performance when they are applied to data that is linguistically different from the corpus used during development. This makes it hard to develop NLP tools for domains for which annotated corpora are not available. This paper explores a number of metrics that attempt to predict the cross-domain performance of an NLP tool through statistical inference. We apply different similarity metrics to compare different domains and investigate the correlation between similarity and accuracy loss of NLP tool. We find that the correlation between the performance of the tool and the similarity metric is linear and that the latter can therefore be used to predict the performance of an NLP tool on out-of-domain data. The approach also provides a way to quantify the difference between domains.