Making holistic schema matching robust: an ensemble approach

Authors:
Bin He;Kevin Chen-Chuan Chang
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Year:
2005

Citing 14
Cited 8

A comparative analysis of methodologies for database schema integration

ACM Computing Surveys (CSUR)
Bagging predictors

Machine Learning
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Mining complex matchings across Web query interfaces

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Automatic extraction of web search interfaces for interface schema integration

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Structured databases on the web: observations and implications

ACM SIGMOD Record
Wise-integrator: an automatic integrator of web search interfaces for E-commerce

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Automatic complex schema matching across Web query interfaces: A correlation mining approach

ACM Transactions on Database Systems (TODS)
Why is schema matching tough and what can we do about it?

ACM SIGMOD Record
Rank Aggregation for Automatic Schema Matching

IEEE Transactions on Knowledge and Data Engineering
Managing Uncertainty in Schema Matcher Ensembles

SUM '07 Proceedings of the 1st international conference on Scalable Uncertainty Management
Advances in Ontology Matching

Advances in Web Semantics I
Association pattern mining for product specification integration

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 2
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
A study on the importance of selection pressure and low dimensional weak learners to produce robust ensembles

Proceedings of the 15th annual conference companion on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web has been rapidly "deepened" by myriad searchable databases online, where data are hidden behind query interfaces. As an essential task toward integrating these massive "deep Web" sources, large scale schema matching (i.e., discovering semantic correspondences of attributes across many query interfaces) has been actively studied recently. In particular, many works have emerged to address this problem by "holistically" matching many schemas at the same time and thus pursuing "mining" approaches in nature. However, while holistic schema matching has built its promise upon the large quantity of input schemas, it also suffers the robustness problem caused by noisy data quality. Such noises often inevitably arise in the automatic extraction of schema data, which is mandatory in large scale integration. For holistic matching to be viable, it is thus essential to make it robust against noisy schemas. To tackle this challenge, we propose a data-ensemble framework with sampling and voting techniques, which is inspired by bagging predictors. Specifically, our approach creates an ensemble of matchers, by randomizing input schema data into many independently downsampled trials, executing the same matcher on each trial and then aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the effectiveness of this data-ensemble framework. Further, empirically, our experiments on real Web data show that the "ensemblization" indeed significantly boosts the matching accuracy under noisy schema input, and thus maintains the desired robustness of a holistic matcher.