Learning Algorithms for Domain Adaptation

Authors:
Manas A. Pathak;Eric H. Nyberg
Affiliations:
Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA 15213;Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA 15213
Venue:
ACML '09 Proceedings of the 1st Asian Conference on Machine Learning: Advances in Machine Learning
Year:
2009

Citing 5
Cited 0

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Improving SVM accuracy by training on auxiliary data sources

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Logistic regression with an auxiliary data source

ICML '05 Proceedings of the 22nd international conference on Machine learning
Discriminative learning for differing training and test distributions

Proceedings of the 24th international conference on Machine learning
Detecting change in data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Quantified Score

Hi-index	0.00

Visualization

Abstract

A fundamental assumption for any machine learning task is to have training and test data instances drawn from the same distribution while having a sufficiently large number of training instances. In many practical settings, this ideal assumption is invalidated as the labeled training instances are scarce and there is a high cost associated with labeling them. On the other hand, we might have access to plenty of labeled data from a different domain, which can provide useful information for the present domain. In this paper, we discuss adaptive learning techniques to address this specific problem: learning with little training data from the same distribution along with a large pool of data from a different distribution. An underlying theme of our work is to identify situations when the auxiliary data is likely to help in training with the primary data. We propose two algorithms for the domain adaptation task: dataset reweighting and subset selection. We present theoretical analysis of behavior of the algorithms based on the concept of domain similarity, which we use to formulate error bounds for our algorithms. We also present an experimental evaluation of our techniques on data from a real world question answering system.