Heuristic strategies for the discovery of inclusion dependencies and other patterns

Authors:
Andreas Koeller;Elke A. Rundensteiner
Affiliations:
Oracle Corporation, NEDC, Nashua, NH;Department of Computer Science, Worcester Polytechnic Institute, Worcester, MA
Venue:
Journal on Data Semantics V
Year:
2006

Citing 22
Cited 1

A comparative analysis of methodologies for database schema integration

ACM Computing Surveys (CSUR)
Statistical profile estimation in database systems

ACM Computing Surveys (CSUR)
A Theory of Attributed Equivalence in Databases with Application to Schema Integration

IEEE Transactions on Software Engineering
Statistical inference of unknown attribute values in databases

CIKM '93 Proceedings of the second international conference on Information and knowledge management
Algorithms for inferring functional dependencies from relations

Data & Knowledge Engineering
SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks

Data & Knowledge Engineering
Inference rules for functional and inclusion dependencies

PODS '83 Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems
Inclusion dependencies and their interaction with functional dependencies

PODS '82 Proceedings of the 1st ACM SIGACT-SIGMOD symposium on Principles of database systems
Levelwise Search and Borders of Theories in KnowledgeDiscovery

Data Mining and Knowledge Discovery
Scalable Algorithms for Association Mining

IEEE Transactions on Knowledge and Data Engineering
The EVE Approach: View Synchronization in Dynamic Distributed Environments

IEEE Transactions on Knowledge and Data Engineering
Analysis of existing databases at the logical level: the DBA companion project

ACM SIGMOD Record
Efficient Algorithms for Mining Inclusion Dependencies

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Query Folding with Inclusion Dependencies

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Online Generation of Association Rules

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Information Integration: The MOMIS Project Demonstration

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Discovery of Constraints from Data for Information System Reverse Engineering

ASWEC '97 Proceedings of the Australian Software Engineering Conference
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases: discovery of meta-information and maintenance of schema-restructuring views

Integration of heterogeneous databases: discovery of meta-information and maintenance of schema-restructuring views
Zigzag: a new algorithm for mining large inclusion dependencies in databases

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Query rewriting and answering under constraints in data integration systems

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Discovering topical structures of databases

Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inclusion dependencies (INDs) between databases are assertions of subset-relationships between sets of attributes (dimensions) in two relations. Such dependencies are useful for a number of purposes related to information integration, such as database similarity discovery and foreign key discovery. An exhaustive approach at discovering INDs between two relations suffers from the dimensionality curse, since the number of potential mappings between the attributes of two relations is exponential in the number of attributes. For this reason, levelwise (Apriori-like) approaches at discovery do not scale beyond relations with 8 to 10 attributes. Approaches modeling the similarity space as graphs or hypergraphs are promising, but also do not scale very well. This paper discusses approaches to scale discovery algorithms for INDs and some other similarity patterns in databases. The major obstacle to scalability is the exponentially growing size of the data structure representing potential INDs. Therefore, the focus of our solution is on heuristic techniques that reduce the number of IND candidates considered by the algorithm. Despite the use of heuristics, the accuracy of the results is good for real-world data. Experiments are presented assessing the quality of the discovery results versus the runtime savings. We conclude that the heuristic approach is useful and improves scalability significantly. It is particularly applicable for relations that have attributes with few distinct values.