GORDIAN: efficient and scalable discovery of composite keys

Authors:
Yannis Sismanis;Paul Brown;Peter J. Haas;Berthold Reinwald
Affiliations:
IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 28
Cited 24

Elements of information theory

Elements of information theory
A method for automatic rule derivation to support semantic query optimization

ACM Transactions on Database Systems (TODS)
Approximate inference of functional dependencies from relations

ICDT '92 Selected papers of the fourth international conference on Database theory
Mining quantitative association rules in large relational tables

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Learning belief networks from data: an information theory based approach

CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
Automated database schema design using mined data dependencies

Journal of the American Society for Information Science - Special issue: knowledge discovery and data mining
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Independence is good: dependency-based histogram synopses for high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Dwarf: shrinking the PetaCube

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Automating Statistics Management for Query Optimizers

IEEE Transactions on Knowledge and Data Engineering
Discovery of Constraints and Data Dependencies in Databases (Extended Abstract)

ECML '95 Proceedings of the 8th European Conference on Machine Learning
Towards the Reverse Engineering of Denormalized Relational Databases

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
Discovering all most specific sentences

ACM Transactions on Database Systems (TODS)
ISOMER: Consistent Histogram Construction Using Query Feedback

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Self-tuning database technology and information services: from wishful thinking to viable engineering

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
The polynomial complexity of fully materialized coalesced cubes

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Automated statistics collection in DB2 UDB

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
CORDS: automatic generation of correlation statistics in DB2

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Toward automated large-scale information integration and discovery

Data Management in a Connected World

Information discovery in loosely integrated data

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Discovering topical structures of databases

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A Visual Interface for on-the-fly Biological Database Integration and Workflow Design Using VizBuilder

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
On-the-Fly Integration and Ad Hoc Querying of Life Sciences Databases Using LifeDB

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
OntoMatch: a monotonically improving schema matching system for autonomous data integration

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Document-centric OLAP in the schema-chaos world

BIRTE'06 Proceedings of the 1st international conference on Business intelligence for the real-time enterprises
Automatic validation of requirements to support multidimensional design

Data & Knowledge Engineering
A framework for multidimensional design of data warehouses from ontologies

Data & Knowledge Engineering
Using ontologies to discover fact IDs

DOLAP '10 Proceedings of the ACM 13th international workshop on Data warehousing and OLAP
The iZi project: easy prototyping of interesting pattern mining algorithms

PAKDD'09 Proceedings of the 13th Pacific-Asia international conference on Knowledge discovery and data mining: new frontiers in applied data mining
Fast detection of functional dependencies in XML data

XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
On multi-column foreign key discovery

Proceedings of the VLDB Endowment
Event correlation for process discovery from web service interaction logs

The VLDB Journal — The International Journal on Very Large Data Bases
Integrating large and distributed life sciences resources for systems biology research: progress and new challenges

Transactions on large-scale data- and knowledge-centered systems III
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
A secured collaborative model for data integration in life sciences

Transactions on large-scale data- and knowledge-centered systems IV
Advancing the discovery of unique column combinations

Proceedings of the 20th ACM international conference on Information and knowledge management
KD2R: a key discovery method for semantic reference reconciliation

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems
Discovery of keys from SQL tables

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems
Armstrong databases: validation, communication and consolidation of conceptual models with perfect test data

APCCM '12 Proceedings of the Eighth Asia-Pacific Conference on Conceptual Modelling - Volume 130
Automated reasoning to infer all minimal keys

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
An automatic key discovery approach for data linking

Web Semantics: Science, Services and Agents on the World Wide Web
Data profiling revisited

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identification of (composite) key attributes is of fundamental importance for many different data management tasks such as data modeling, data integration, anomaly detection, query formulation, query optimization, and indexing. However, information about keys is often missing or incomplete in many real-world database scenarios. Surprisingly, the fundamental problem of automatic key discovery has received little attention in the existing literature. Existing solutions ignore composite keys, due to the complexity associated with their discovery. Even for simple keys, current algorithms take a brute-force approach; the resulting exponential CPU and memory requirements limit the applicability of these methods to small datasets. In this paper, we describe GORDIAN, a scalable algorithm for automatic discovery of keys in large datasets, including composite keys. GORDIAN can provide exact results very efficiently for both real-world and synthetic datasets. GORDIAN can be used to find (composite) key attributes in any collection of entities, e.g., key column-groups in relational data, or key leaf-node sets in a collection of XML documents with a common schema. We show empirically that GORDIAN can be combined with sampling to efficiently obtain high quality sets of approximate keys even in very large datasets.