A comparative analysis of methodologies for database schema integration
ACM Computing Surveys (CSUR)
Statistical profile estimation in database systems
ACM Computing Surveys (CSUR)
A Theory of Attributed Equivalence in Databases with Application to Schema Integration
IEEE Transactions on Software Engineering
Statistical inference of unknown attribute values in databases
CIKM '93 Proceedings of the second international conference on Information and knowledge management
Algorithms for inferring functional dependencies from relations
Data & Knowledge Engineering
Data & Knowledge Engineering
Inference rules for functional and inclusion dependencies
PODS '83 Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems
Inclusion dependencies and their interaction with functional dependencies
PODS '82 Proceedings of the 1st ACM SIGACT-SIGMOD symposium on Principles of database systems
Levelwise Search and Borders of Theories in KnowledgeDiscovery
Data Mining and Knowledge Discovery
Scalable Algorithms for Association Mining
IEEE Transactions on Knowledge and Data Engineering
The EVE Approach: View Synchronization in Dynamic Distributed Environments
IEEE Transactions on Knowledge and Data Engineering
Efficient Algorithms for Mining Inclusion Dependencies
EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Query Folding with Inclusion Dependencies
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Online Generation of Association Rules
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Information Integration: The MOMIS Project Demonstration
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Discovery of Constraints from Data for Information System Reverse Engineering
ASWEC '97 Proceedings of the Australian Software Engineering Conference
On schema matching with opaque column names and data values
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases: discovery of meta-information and maintenance of schema-restructuring views
Zigzag: a new algorithm for mining large inclusion dependencies in databases
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Query rewriting and answering under constraints in data integration systems
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Discovering topical structures of databases
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Hi-index | 0.00 |
Inclusion dependencies (INDs) between databases are assertions of subset-relationships between sets of attributes (dimensions) in two relations. Such dependencies are useful for a number of purposes related to information integration, such as database similarity discovery and foreign key discovery. An exhaustive approach at discovering INDs between two relations suffers from the dimensionality curse, since the number of potential mappings between the attributes of two relations is exponential in the number of attributes. For this reason, levelwise (Apriori-like) approaches at discovery do not scale beyond relations with 8 to 10 attributes. Approaches modeling the similarity space as graphs or hypergraphs are promising, but also do not scale very well. This paper discusses approaches to scale discovery algorithms for INDs and some other similarity patterns in databases. The major obstacle to scalability is the exponentially growing size of the data structure representing potential INDs. Therefore, the focus of our solution is on heuristic techniques that reduce the number of IND candidates considered by the algorithm. Despite the use of heuristics, the accuracy of the results is good for real-world data. Experiments are presented assessing the quality of the discovery results versus the runtime savings. We conclude that the heuristic approach is useful and improves scalability significantly. It is particularly applicable for relations that have attributes with few distinct values.