Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Improving SVM accuracy by training on auxiliary data sources
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Logistic regression with an auxiliary data source
ICML '05 Proceedings of the 22nd international conference on Machine learning
Discriminative learning for differing training and test distributions
Proceedings of the 24th international conference on Machine learning
Detecting change in data streams
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Hi-index | 0.00 |
A fundamental assumption for any machine learning task is to have training and test data instances drawn from the same distribution while having a sufficiently large number of training instances. In many practical settings, this ideal assumption is invalidated as the labeled training instances are scarce and there is a high cost associated with labeling them. On the other hand, we might have access to plenty of labeled data from a different domain, which can provide useful information for the present domain. In this paper, we discuss adaptive learning techniques to address this specific problem: learning with little training data from the same distribution along with a large pool of data from a different distribution. An underlying theme of our work is to identify situations when the auxiliary data is likely to help in training with the primary data. We propose two algorithms for the domain adaptation task: dataset reweighting and subset selection. We present theoretical analysis of behavior of the algorithms based on the concept of domain similarity, which we use to formulate error bounds for our algorithms. We also present an experimental evaluation of our techniques on data from a real world question answering system.