Schema matching on streams with accuracy guarantees

  • Authors:
  • Szymon Jaroszewicz;Lenka Ivantysynova;Tobias Scheffer

  • Affiliations:
  • National Institute of Telecommunications, Warsaw, Poland. E-mail: s.jaroszewicz@itl.waw.pl;Humboldt-Universität zu Berlin, Berlin, Germany. E-mail: lenka@wiwi.hu-berlin.de;Max Planck Institute for Computer Science, Saarbrücken, Germany. E-mail: scheffer@mpi-inf.mpg.de

  • Venue:
  • Intelligent Data Analysis - Knowledge Discovery from Data Streams
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records - which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to find a matching that is a close approximation of the matching that would be obtained if the entire stream were processed. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.