Learning Algorithms for Domain Adaptation

  • Authors:
  • Manas A. Pathak;Eric H. Nyberg

  • Affiliations:
  • Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA 15213;Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA 15213

  • Venue:
  • ACML '09 Proceedings of the 1st Asian Conference on Machine Learning: Advances in Machine Learning
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

A fundamental assumption for any machine learning task is to have training and test data instances drawn from the same distribution while having a sufficiently large number of training instances. In many practical settings, this ideal assumption is invalidated as the labeled training instances are scarce and there is a high cost associated with labeling them. On the other hand, we might have access to plenty of labeled data from a different domain, which can provide useful information for the present domain. In this paper, we discuss adaptive learning techniques to address this specific problem: learning with little training data from the same distribution along with a large pool of data from a different distribution. An underlying theme of our work is to identify situations when the auxiliary data is likely to help in training with the primary data. We propose two algorithms for the domain adaptation task: dataset reweighting and subset selection. We present theoretical analysis of behavior of the algorithms based on the concept of domain similarity, which we use to formulate error bounds for our algorithms. We also present an experimental evaluation of our techniques on data from a real world question answering system.