Detecting attribute dependencies from query feedback

Authors:
Peter J. Haas;Fabian Hueske;Volker Markl
Affiliations:
IBM Almaden Research Center, San Jose, CA;University of Ulm, Ulm, Germany;IBM Almaden Research Center, San Jose, CA
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 17
Cited 4

Approximate computation of multidimensional aggregates of sparse data using wavelets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On Updating Problems in Latent Semantic Indexing

SIAM Journal on Scientific Computing
Independence is good: dependency-based histogram synopses for high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
An intelligent middleware for linear correlation discovery

Decision Support Systems
Automating Statistics Management for Query Optimizers

IEEE Transactions on Knowledge and Data Engineering
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
ISOMER: Consistent Histogram Construction Using Query Feedback

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
SASH: a self-adaptive histogram set for dynamically changing workloads

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Automated statistics collection in DB2 UDB

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Integrating a maximum-entropy cardinality estimator into DB2 UDB

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Rough Sets in Data Warehousing

RSCTC '08 Proceedings of the 6th International Conference on Rough Sets and Current Trends in Computing
ROX: run-time optimization of XQueries

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Extending functional dependency to detect abnormal data in RDF graphs

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
A greedy algorithm for dimensionality reduction in polynomial regression to forecast the performance of a power plant condenser

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Real-world datasets exhibit a complex dependency structure among the data attributes. Learning this structure is a key task in automatic statistics configuration for query optimizers, as well as in data mining, metadata discovery, and system management. In this paper, we provide a new method for discovering dependent attribute pairs based on query feedback. Our approach avoids the problem of searching through a combinatorially large space of candidate attribute pairs, automatically focusing system resources on those pairs of demonstrable interest to users. Unlike previous methods, our technique combines all of the pertinent feedback for a specified pair of attributes in a principled and robust manner, while being simple and fast enough to be incorporated into current commercial products. The method is similar in spirit to the CORDS algorithm, which proactively collects frequencies of data values and computes a chi-squared statistic from the resulting contingency table. In the reactive query-feedback setting, many entries of the contingency table are missing, and a key contribution of this paper is a variant of classical chi-squared theory that handles this situation. Because we typically discover a large number of dependent attribute pairs, we provide novel methods for ranking the pairs based on degree of dependency. Such ranking information, e.g., enables a database system to avoid exceeding the space budget for the system catalog by storing only the currently most important multivariate statistics. Experiments indicate that our dependency rankings are stable even in the presence of relatively few feedback records.