Correlation maps: a compressed access method for exploiting soft functional dependencies

Authors:
Hideaki Kimura;George Huo;Alexander Rasin;Samuel Madden;Stanley B. Zdonik
Affiliations:
Brown University;Google, Inc.;Brown University;MIT CSAIL;Brown University
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 13
Cited 6

A system for semantic query optimization

SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
Prefix B-trees

ACM Transactions on Database Systems (TODS)
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Exploiting constraint-like data characterizations in query optimization

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Discovery and Application of Check Constraints in DB2

Proceedings of the 17th International Conference on Data Engineering
Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Approximate encoding for direct access and query processing over compressed bitmaps

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
QUIST: a system for semantic query optimization in relational databases

VLDB '81 Proceedings of the seventh international conference on Very Large Data Bases - Volume 7
Knowledge-based query processing

VLDB '80 Proceedings of the sixth international conference on Very Large Data Bases - Volume 6
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Adjoined Dimension Column Clustering to Improve Data Warehouse Query Performance

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

UPI: a primary index for uncertain databases

Proceedings of the VLDB Endowment
CORADD: correlation aware database designer for materialized views and indexes

Proceedings of the VLDB Endowment
Predicting cost amortization for query services

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Differential dependencies: Reasoning and discovery

ACM Transactions on Database Systems (TODS)
Design by example for SQL table definitions with functional dependencies

The VLDB Journal — The International Journal on Very Large Data Bases
Optimizing index deployment order for evolving OLAP

Proceedings of the 15th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In relational query processing, there are generally two choices for access paths when performing a predicate lookup for which no clustered index is available. One option is to use an unclustered index. Another is to perform a complete sequential scan of the table. Many analytical workloads do not benefit from the availability of unclustered indexes; the cost of random disk I/O becomes prohibitive for all but the most selective queries. It has been observed that a secondary index on an unclustered attribute can perform well under certain conditions if the unclustered attribute is correlated with a clustered index attribute [4]. The clustered index will co-locate values and the correlation will localize access through the unclustered attribute to a subset of the pages. In this paper, we show that in a real application (SDSS) and widely used benchmark (TPC-H), there exist many cases of attribute correlation that can be exploited to accelerate queries. We also discuss a tool that can automatically suggest useful pairs of correlated attributes. It does so using an analytical cost model that we developed, which is novel in its awareness of the effects of clustering and correlation. Furthermore, we propose a data structure called a Correlation Map (CM) that expresses the mapping between the correlated attributes, acting much like a secondary index. The paper also discusses how bucketing on the domains of both attributes in the correlated attribute pair can dramatically reduce the size of the CM to be potentially orders of magnitude smaller than that of a secondary B+Tree index. This reduction in size allows us to create a large number of CMs that improve performance for a wide range of queries. The small size also reduces maintenance costs as we demonstrate experimentally.