Analyzing and revising data integration schemas to improve their matchability

  • Authors:
  • Xiaoyong Chai;Mayssam Sayyadian;AnHai Doan;Arnon Rosenthal;Len Seligman

  • Affiliations:
  • University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;The MITRE Corporation;The MITRE Corporation

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data integration systems often provide a uniform query interface, called a mediated schema, to a multitude of data sources. To answer user queries, such systems employ a set of semantic matches between the mediated schema and the data-source schemas. Finding such matches is well known to be difficult. Hence much work has focused on developing semi-automatic techniques to efficiently find the matches. In this paper we consider the complementary problem of improving the mediated schema, to make finding such matches easier. Specifically, a mediated schema S will typically be matched with many source schemas. Thus, can the developer of S analyze and revise S in a way that preserves S's semantics, and yet makes it easier to match with in the future? In this paper we provide an affirmative answer to the above question, and outline a promising solution direction, called mSeer. Given a mediated schema S and a matching tool M, mSeer first computes a matchability score that quantifies how well S can be matched against using M. Next, mSeer uses this score to generate a matchability report that identifies the problems in matching S. Finally, mSeer addresses these problems by automatically suggesting changes to S (e.g., renaming an attribute, reformatting data values, etc.) that it believes will preserve the semantics of S and yet make it more amenable to matching. We present extensive experiments over several real-world domains that demonstrate the promise of the proposed approach.