Efficiently Computing Inclusion Dependencies for Schema Discovery

Authors:
Jana Bauckmann;Ulf Leser;Felix Naumann
Affiliations:
Humboldt-Universitat zu Berlin, Germany;Humboldt-Universitat zu Berlin, Germany;Humboldt-Universitat zu Berlin, Germany
Venue:
ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Year:
2006

Citing 0
Cited 4

A constraint-based tool for data integrity management on the web

Proceedings of the 4th International Conference on Uniquitous Information Management and Communication
Filtering and ranking schemes for finding inclusion dependencies on the web

Proceedings of the 21st international conference companion on World Wide Web
Meta-modeling of inclusion dependency constraints

Proceedings of the 6th Balkan Conference in Informatics
Efficient filtering and ranking schemes for finding inclusion dependencies on the web

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large data integration projects must often cope with undocumented data sources. Schema discovery aims at automatically finding structures in such cases. An important class of relationships between attributes that can be detected automatically are inclusion dependencies (IND), which provide an excellent basis for guessing foreign key constraints. INDs can be discovered by comparing the sets of distinct values of pairs of attributes. In this paper we present efficient algorithms for finding unary INDs. We first show that (and why) SQL is not suitable for this task. We then develop two algorithms that compute inclusion dependencies outside of the database. Both are much faster than the SQL-based methods; in fact, for larger schemas they are the only feasible solution. Our experiments show that we can compute all unary INDs in a schema of 1, 680 attributes with a total database size of 3.2 GB in approximately 2.5 hours.