Validating Multi-column Schema Matchings by Type

Authors:
Bing Tian Dai;Nick Koudas;Divesh Srivastava;Anthony K. H. Tung;Suresh Venkatasubramanian
Affiliations:
National University of Singapore, Singapore 117590, Republic of Singapore. daibingt@comp.nus.edu.sg;University of Toronto, Toronto, ON M5S 2E4, Canada. koudas@cs.toronto.edu;AT&TLabs-Research, Florham Park, NJ 07932, USA. divesh@research.att.com;National University of Singapore, Singapore 117590, Republic of Singapore. atung@comp.nus.edu.sg;University of Utah, Salt Lake City, UT 84112, USA. suresh@cs.utah.edu
Venue:
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Year:
2008

Citing 0
Cited 11

Content-based ontology matching for GIS datasets

Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems
Type-based categorization of relational attributes

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Global Interoperability Using Semantics, Standards, Science and Technology (GIS3T)

Computer Standards & Interfaces
Design of a temporal geosocial semantic web for military stabilization and reconstruction operations

Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics
Geographically-typed semantic schema matching

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Information theory for data management

Proceedings of the VLDB Endowment
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Information theory for data management

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Enhanced geographically typed semantic schema matching

Web Semantics: Science, Services and Agents on the World Wide Web
Schema-as-you-go: on probabilistic tagging and querying of wide tables

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Instance-Based matching of large ontologies using locality-sensitive hashing

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Validation of multi-column schema matchings is essential for successful database integration. This task is especially difficult when the databases to be integrated contain little overlapping data, as is often the case in practice (e.g., customer bases of different companies). Based on the intuition that values present in different columns related by a schema matching will have similar "semantic type", and that this can be captured using distributions over values ("statistical types"), we develop a method for validating 1-1 and compositional schema matchings. Our technique is based on three key technical ideas. First, we propose a generic measure for comparing two columns matched by a schema matching, based on a notion of information-theoretic discrepancy that generalizes the standard geometric discrepancy; this provides the basis for 1:1 matching. Second, we present an algorithm for "splitting" the string values in a column to identify substrings that are likely to match with the values in another column; this enables (multi-column) 1:m schema matching. Third, our technique provides an invalidation certificate if it fails to validate a schema matching. We complement our conceptual and algorithmic contributions with an experimental study that demonstrates the effectiveness and efficiency of our technique on a variety of database schemas and data sets.