Making holistic schema matching robust: an ensemble approach

  • Authors:
  • Bin He;Kevin Chen-Chuan Chang

  • Affiliations:
  • University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL

  • Venue:
  • Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Web has been rapidly "deepened" by myriad searchable databases online, where data are hidden behind query interfaces. As an essential task toward integrating these massive "deep Web" sources, large scale schema matching (i.e., discovering semantic correspondences of attributes across many query interfaces) has been actively studied recently. In particular, many works have emerged to address this problem by "holistically" matching many schemas at the same time and thus pursuing "mining" approaches in nature. However, while holistic schema matching has built its promise upon the large quantity of input schemas, it also suffers the robustness problem caused by noisy data quality. Such noises often inevitably arise in the automatic extraction of schema data, which is mandatory in large scale integration. For holistic matching to be viable, it is thus essential to make it robust against noisy schemas. To tackle this challenge, we propose a data-ensemble framework with sampling and voting techniques, which is inspired by bagging predictors. Specifically, our approach creates an ensemble of matchers, by randomizing input schema data into many independently downsampled trials, executing the same matcher on each trial and then aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the effectiveness of this data-ensemble framework. Further, empirically, our experiments on real Web data show that the "ensemblization" indeed significantly boosts the matching accuracy under noisy schema input, and thus maintains the desired robustness of a holistic matcher.