CORDS: automatic discovery of correlations and soft functional dependencies

Authors:
Ihab F. Ilyas;Volker Markl;Peter Haas;Paul Brown;Ashraf Aboulnaga
Affiliations:
Purdue University, West Lafayette, Indiana;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Year:
2004

Citing 11
Cited 68

A method for automatic rule derivation to support semantic query optimization

ACM Transactions on Database Systems (TODS)
Mining quantitative association rules in large relational tables

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Learning belief networks from data: an information theory based approach

CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Independence is good: dependency-based histogram synopses for high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
SASH: a self-adaptive histogram set for dynamically changing workloads

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Maintaining Implicated Statistics in Constrained Environments

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
MYSTIQ: a system for finding more answers by using probabilities

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Consistently estimating the selectivity of conjuncts of predicates

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Content-based routing: different plans for different data

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Answering queries from statistics and probabilistic views

VLDB '05 Proceedings of the 31st international conference on Very large data bases
TAPER: A Two-Step Approach for All-Strong-Pairs Correlation Query in Large Databases

IEEE Transactions on Knowledge and Data Engineering
From HTML documents to web tables and rules

ICEC '06 Proceedings of the 8th international conference on Electronic commerce: The new e-commerce: innovations for conquering current barriers, obstacles and limitations to conducting successful business on the internet
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Consistent selectivity estimation via maximum entropy

The VLDB Journal — The International Journal on Very Large Data Bases
Mining constraint violations

ACM Transactions on Database Systems (TODS)
Cardinality estimation using sample views with quality assurance

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Partition search for non-binary constraint satisfaction

Information Sciences: an International Journal
Automated statistics collection in DB2 UDB

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
CORDS: automatic generation of correlation statistics in DB2

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Detecting attribute dependencies from query feedback

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Probabilistic graphical models and their role in databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Categorical skylines for streaming data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Query evaluation with soft-key constraints

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Using error-correcting dependencies for collaborative filtering

Data & Knowledge Engineering
Volatile correlation computation: a checkpoint view

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
HLS: Tunable Mining of Approximate Functional Dependencies

BNCOD '08 Proceedings of the 25th British national conference on Databases: Sharing Data, Information and Knowledge
On generating near-optimal tableaux for conditional functional dependencies

Proceedings of the VLDB Endowment
Dynamic faceted search for discovery-driven analysis

Proceedings of the 17th ACM conference on Information and knowledge management
The Harmony Integration Workbench

Journal on Data Semantics XI
Sample synopses for approximate answering of group-by queries

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Top-K Correlation Sub-graph Search in Graph Databases

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Troubleshooting chronic conditions in large IP networks

CoNEXT '08 Proceedings of the 2008 ACM CoNEXT Conference
Estimating the confidence of conditional functional dependencies

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Filtered statistics

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient discovery of join plans in schemaless data

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Depth first algorithms and inferencing for AFD mining

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Recursive random fields

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Query processing over incomplete autonomous databases: query rewriting using learned data dependencies

The VLDB Journal — The International Journal on Very Large Data Bases
StatAdvisor: recommending statistical views

Proceedings of the VLDB Endowment
Correlation maps: a compressed access method for exploiting soft functional dependencies

Proceedings of the VLDB Endowment
Keyword search for data-centric XML collections with long text fields

Proceedings of the 13th International Conference on Extending Database Technology
Managing scientific data

Communications of the ACM
Measuring independence of datasets

Proceedings of the forty-second ACM symposium on Theory of computing
Supporting ranking queries on uncertain and incomplete data

The VLDB Journal — The International Journal on Very Large Data Bases
Scaling up top-K cosine similarity search

Data & Knowledge Engineering
CORADD: correlation aware database designer for materialized views and indexes

Proceedings of the VLDB Endowment
G-RCA: a generic root cause analysis platform for service quality management in large IP networks

Proceedings of the 6th International COnference
Using structural information in XML keyword search effectively

ACM Transactions on Database Systems (TODS)
Predicting cost amortization for query services

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Differential dependencies: Reasoning and discovery

ACM Transactions on Database Systems (TODS)
A call to arms: revisiting database design

ACM SIGMOD Record
Beauty and the beast: the theory and practice of information integration

ICDT'07 Proceedings of the 11th international conference on Database Theory
Self-adaptive statistics management for efficient query processing

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Integrating a maximum-entropy cardinality estimator into DB2 UDB

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Design by example for SQL table definitions with functional dependencies

The VLDB Journal — The International Journal on Very Large Data Bases
Toward automated large-scale information integration and discovery

Data Management in a Connected World
Case retrieval with combined adaptability and similarity criteria: application to case retrieval nets

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Ontology guided data linkage framework for discovering meaningful data facts

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
SMARTINT: using mined attribute dependencies to integrate fragmented web databases

Journal of Intelligent Information Systems
Optimizing ranked retrieval

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Decomposition-by-normalization (DBN): leveraging approximate functional dependencies for efficient tensor decomposition

Proceedings of the 21st ACM international conference on Information and knowledge management
G-RCA: a generic root cause analysis platform for service quality management in large IP networks

IEEE/ACM Transactions on Networking (TON)
Efficiently adapting graphical models for selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Pragmatic correlation analysis for probabilistic ranking over relational data

Expert Systems with Applications: An International Journal
Comparable dependencies over heterogeneous data

The VLDB Journal — The International Journal on Very Large Data Bases
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems
Correlation range query

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
UpSizeR: Synthetically scaling an empirical relational database

Information Systems
Automated discovery of multi-faceted ontologies for accurate query answering and future semantic reasoning

Data & Knowledge Engineering
Editorial: Efficient discovery of similarity constraints for matching dependencies

Data & Knowledge Engineering
Discovering denial constraints

Proceedings of the VLDB Endowment
Data profiling revisited

ACM SIGMOD Record

Quantified Score

Hi-index	0.02

Visualization

Abstract

The rich dependency structure found in the columns of real-world relational databases can be exploited to great advantage, but can also cause query optimizers---which usually assume that columns are statistically independent---to underestimate the selectivities of conjunctive predicates by orders of magnitude. We introduce CORDS, an efficient and scalable tool for automatic discovery of correlations and soft functional dependencies between columns. CORDS searches for column pairs that might have interesting and useful dependency relations by systematically enumerating candidate pairs and simultaneously pruning unpromising candidates using a flexible set of heuristics. A robust chi-squared analysis is applied to a sample of column values in order to identify correlations, and the number of distinct values in the sampled columns is analyzed to detect soft functional dependencies. CORDS can be used as a data mining tool, producing dependency graphs that are of intrinsic interest. We focus primarily on the use of CORDS in query optimization. Specifically, CORDS recommends groups of columns on which to maintain certain simple joint statistics. These "column-group" statistics are then used by the optimizer to avoid naive selectivity estimates based on inappropriate independence assumptions. This approach, because of its simplicity and judicious use of sampling, is relatively easy to implement in existing commercial systems, has very low overhead, and scales well to the large numbers of columns and large table sizes found in real-world databases. Experiments with a prototype implementation show that the use of CORDS in query optimization can speed up query execution times by an order of magnitude. CORDS can be used in tandem with query feedback systems such as the LEO learning optimizer, leveraging the infrastructure of such systems to correct bad selectivity estimates and ameliorating the poor performance of feedback systems during slow learning phases.