Domain adaptation of natural language processing systems

Authors:
Fernando Pereira;John Blitzer
Affiliations:
University of Pennsylvania;University of Pennsylvania
Venue:
Domain adaptation of natural language processing systems
Year:
2008

Citing 0
Cited 7

Structural correspondence learning for parse disambiguation

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Cross-domain sentiment classification via spectral feature alignment

Proceedings of the 19th international conference on World wide web
Domain adaptation to summarize human conversations

DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Entire relaxation path for maximum entropy problems

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Research on domain-specific features clustering based spectral clustering

ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Learning from bullying traces in social media

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
A novel discriminative framework for sentence-level discourse analysis

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical language processing models are being applied to an ever wider and more varied range of linguistic domains. Collecting and curating training sets for each different domain is prohibitively expensive, and at the same time differences in vocabulary and writing style across domains can cause state-of-the-art supervised models to dramatically increase in error. The first part of this thesis describes structural correspondence learning (SCL), a method for adapting linear discriminative models from resource-rich source domains to resource-poor target domains. The key idea is the use of pivot features which occur frequently and behave similarly in both the source and target domains. SCL builds a shared representation by searching for a low-dimensional feature subspace that allows us to accurately predict the presence or absence of pivot features on unlabeled data. We demonstrate SCL on two text processing problems: sentiment classification of product reviews and part of speech tagging. For both tasks, SCL significantly improves over state of the art supervised models using only unlabeled target data. In the second part of the thesis, we develop a formal framework for analyzing domain adaptation tasks. We first describe a measure of divergence, the HDH -divergence, that depends on the hypothesis class H from which we estimate our supervised model. We then use this measure to state an upper bound on the true target error of a model trained to minimize a convex combination of empirical source and target errors. The bound characterizes the tradeoff inherent in training on both the large quantity of biased source data and the small quantity of unbiased target data, and we can compute it from finite labeled and unlabeled samples of the source and target distributions under relatively weak assumptions. Finally, we confirm experimentally that the bound corresponds well to empirical target error for the task of sentiment classification.