BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

Authors:
Paul G. Brown;Peter J. Hass
Affiliations:
IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Year:
2003

Citing 11
Cited 20

A method for automatic rule derivation to support semantic query optimization

ACM Transactions on Database Systems (TODS)
Automated database schema design using mined data dependencies

Journal of the American Society for Information Science - Special issue: knowledge discovery and data mining
Independence is good: dependency-based histogram synopses for high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Wavelet synopses with error guarantees

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Automatic Knowledge Acquisition and Maintenance for Semantic Query Optimization

IEEE Transactions on Knowledge and Data Engineering
Discovery of Constraints and Data Dependencies in Databases (Extended Abstract)

ECML '95 Proceedings of the 8th European Conference on Machine Learning
Towards the Reverse Engineering of Denormalized Relational Databases

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Self-tuning database technology and information services: from wishful thinking to viable engineering

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
XML Mapping technology: making connections in an XML-centric world

IBM Systems Journal
From HTML documents to web tables and rules

ICEC '06 Proceedings of the 8th international conference on Electronic commerce: The new e-commerce: innovations for conquering current barriers, obstacles and limitations to conducting successful business on the internet
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
GORDIAN: efficient and scalable discovery of composite keys

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Detecting attribute dependencies from query feedback

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
SQAK: doing more with keywords

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
On generating near-optimal tableaux for conditional functional dependencies

Proceedings of the VLDB Endowment
Sample synopses for approximate answering of group-by queries

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Estimating the confidence of conditional functional dependencies

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient discovery of join plans in schemaless data

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Correlation maps: a compressed access method for exploiting soft functional dependencies

Proceedings of the VLDB Endowment
Handling inconsistency of vague relations with functional dependencies

ER'07 Proceedings of the 26th international conference on Conceptual modeling
How to juggle columns: an entropy-based approach for table compression

Proceedings of the Fourteenth International Database Engineering & Applications Symposium
CORADD: correlation aware database designer for materialized views and indexes

Proceedings of the VLDB Endowment
Maintaining consistency of probabilistic databases: a linear programming approach

ER'10 Proceedings of the 29th international conference on Conceptual modeling
Discovering event correlation rules for semi-structured business processes

Proceedings of the 5th ACM international conference on Distributed event-based system
Extending functional dependency to detect abnormal data in RDF graphs

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Toward automated large-scale information integration and discovery

Data Management in a Connected World
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the BHUNT scheme for automatically discovering algebraic constraints between pairs of columns in relational data. The constraints may be "fuzzy" in that they hold for most, but not all, of the records, and the columns may be in the same table or different tables. Such constraints are of interest in the context of both data mining and query optimization, and the BHUNT methodology can potentially be adapted to discover fuzzy functional dependencies and other useful relationships. BHUNT first identifies candidate sets of column value pairs that are likely to satisfy an algebraic constraint. This discovery process exploits both system catalog information and data samples, and employs pruning heuristics to control processing costs. For each candidate, BHUNT constructs algebraic constraints by applying statistical histogramming, segmentation, or clustering techniques to samples of column values. Using results from the theory of tolerance intervals, the sample sizes can be chosen to control the number of "exception" records that fail to satisfy the discovered constraints. In query-optimization mode, BHUNT can automatically partition the data into normal and exception records. During subsequent query processing, queries can be modified to incorporate the constraints; the optimizer uses the constraints to identify new, more efficient access paths. The results are then combined with the results of executing the original query against the (small) set of exception records. Experiments on a very large database using a prototype implementation of BHUNT show reductions in table accesses of up to two orders of magnitude, leading to speedups in query processing by up to a factor of 6.8.