Estimating the confidence of conditional functional dependencies

Authors:
Graham Cormode;Lukasz Golab;Korn Flip;Andrew McGregor;Divesh Srivastava;Xi Zhang
Affiliations:
AT&T Labs, Florham Park, NJ, USA;AT&T Labs, Florham Park, NJ, USA;AT&T Labs, Florham Park, NJ, USA;University of Massachusetts Amherst, Amherst, MA, USA;AT&T Labs, Florham Park, NJ, USA;SUNY Buffalo, Buffalo, NY, USA
Venue:
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Year:
2009

Citing 21
Cited 4

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Approximate inference of functional dependencies from relations

ICDT '92 Selected papers of the fourth international conference on Database theory
Communication complexity

Communication complexity
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Space efficient mining of multigraph streams

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Simpler algorithm for estimating frequency moments of data streams

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Techniques for Warehousing of Sample Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Extending dependencies with conditions

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Improving data quality: consistency and accuracy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Conditional functional dependencies for capturing data inconsistencies

ACM Transactions on Database Systems (TODS)
Propagating functional dependencies with conditions

Proceedings of the VLDB Endowment
Discovering data quality rules

Proceedings of the VLDB Endowment
Increasing the Expressivity of Conditional Functional Dependencies without Extra Complexity

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Discovering Conditional Functional Dependencies

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory

Stream warehousing with DataDepot

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
GDR: a system for guided data repair

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Space-optimal heavy hitters with strong error bounds

ACM Transactions on Database Systems (TODS)
Comparable dependencies over heterogeneous data

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD)gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large. We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.