Domain adaptation of natural language processing systems

  • Authors:
  • Fernando Pereira;John Blitzer

  • Affiliations:
  • University of Pennsylvania;University of Pennsylvania

  • Venue:
  • Domain adaptation of natural language processing systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Statistical language processing models are being applied to an ever wider and more varied range of linguistic domains. Collecting and curating training sets for each different domain is prohibitively expensive, and at the same time differences in vocabulary and writing style across domains can cause state-of-the-art supervised models to dramatically increase in error. The first part of this thesis describes structural correspondence learning (SCL), a method for adapting linear discriminative models from resource-rich source domains to resource-poor target domains. The key idea is the use of pivot features which occur frequently and behave similarly in both the source and target domains. SCL builds a shared representation by searching for a low-dimensional feature subspace that allows us to accurately predict the presence or absence of pivot features on unlabeled data. We demonstrate SCL on two text processing problems: sentiment classification of product reviews and part of speech tagging. For both tasks, SCL significantly improves over state of the art supervised models using only unlabeled target data. In the second part of the thesis, we develop a formal framework for analyzing domain adaptation tasks. We first describe a measure of divergence, the HDH -divergence, that depends on the hypothesis class H from which we estimate our supervised model. We then use this measure to state an upper bound on the true target error of a model trained to minimize a convex combination of empirical source and target errors. The bound characterizes the tradeoff inherent in training on both the large quantity of biased source data and the small quantity of unbiased target data, and we can compute it from finite labeled and unlabeled samples of the source and target distributions under relatively weak assumptions. Finally, we confirm experimentally that the bound corresponds well to empirical target error for the task of sentiment classification.