A risk minimization framework for domain adaptation

  • Authors:
  • Bo Long;Sudarshan Lamkhede;Srinivas Vadrevu;Ya Zhang;Belle Tseng

  • Affiliations:
  • Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Supervised learning algorithms usually require high quality labeled training set of large volume. It is often expensive to obtain such labeled examples in every domain of an application. Domain adaptation aims to help in such cases by utilizing data available in related domains. However transferring knowledge from one domain to another is often non trivial due to different data distributions among the domains. Moreover, it is usually very hard to measure and formulate these distribution differences. Hence we introduce a new concept of label-relation function to transfer knowledge among different domains without explicitly formulating the data distribution differences. A novel learning framework, Domain Transfer Risk Minimization (DTRM), is proposed based on this concept. DTRM simultaneously minimizes the empirical risk for the target and the regularized empirical risk for source domain. Under this framework, we further derive a generic algorithm called Domain Adaptation by Label Relation (DALR) that is applicable to various applications in both classification and regression settings. DALR iteratively updates the target hypothesis function and outputs for the source domain until it converges. We provide an in-depth theoretical analysis of DTRM and establish fundamental error bounds. We also experimentally evaluate DALR on the task of ranking search results using real-world data. Our experimental results show that the proposed algorithm effectively and robustly utilizes data from source domains under various conditions: different sizes for source domain data; different noise levels for source domain data, and different difficulty levels for target domain data.