Snakes and sandwiches: optimal clustering strategies for a data warehouse

Authors:
H. V. Jagadish;Laks V. S. Lakshmanan;Divesh Srivastava
Affiliations:
U of Illinois, Urbana-Champaign;IIT, Bombay;AT&T Labs-Research
Venue:
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Year:
1999

Citing 19
Cited 17

Multiattribute hashing using Gray codes

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
Gray Codes for Partial Match and Range Queries

IEEE Transactions on Software Engineering
Fractals for secondary key retrieval

PODS '89 Proceedings of the eighth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Linear clustering of objects with multiple attributes

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Analysis of the Hilbert curve for representing two-dimensional space

Information Processing Letters
Improved query performance with variant indexes

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
On the analysis of indexing schemes

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A lower bound theorem for indexing schemes and its application to multidimensional range queries

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Caching multidimensional queries using chunks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A class of data structures for associative searching

PODS '84 Proceedings of the 3rd ACM SIGACT-SIGMOD symposium on Principles of database systems
Analysis of the n-Dimensional Quadtree Decomposition for Arbitrary Hyperrectangles

IEEE Transactions on Knowledge and Data Engineering
Index Selection for OLAP

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Efficient Organization of Large Multidimensional Arrays

Proceedings of the Tenth International Conference on Data Engineering
Materialized Views Selection in a Multidimensional Database

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Aggregate-Query Processing in Data Warehousing Environments

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Answering Queries with Aggregation Using Views

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
WATCHMAN: A Data Warehouse Intelligent Cache Manager

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
On the ordering of multiattribute data in information retrieval systems

On the ordering of multiattribute data in information retrieval systems

Finding Localized Associations in Market Basket Data

IEEE Transactions on Knowledge and Data Engineering
Redefining Clustering for High-Dimensional Applications

IEEE Transactions on Knowledge and Data Engineering
OLAP Query Routing and Physical Design in a Database Cluster

EDBT '00 Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology
Oracle8i Index-Organized Table and Its Application to New Domains

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Efficient Region Query Processing by Optimal Page Ordering

ADBIS-DASFAA '00 Proceedings of the East-European Conference on Advances in Databases and Information Systems Held Jointly with International Conference on Database Systems for Advanced Applications: Current Issues in Databases and Information Systems
Scheduling Queries for Tape-Resident Data

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Optimal Page Ordering for Region Queries in Static Spatial Databases

DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
Using Datacube Aggregates for Approximate Querying and Deviation Detection

IEEE Transactions on Knowledge and Data Engineering
Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more

Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more
On multidimensional data and modern disks

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Hierarchical clustering for OLAP: the CUBE File approach

The VLDB Journal — The International Journal on Very Large Data Bases
A search space reduction methodology for data mining in large databases

Engineering Applications of Artificial Intelligence
HyperDB: a PC-based database cluster system for efficient OLAP query processing

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
A search space reduction methodology for large databases: a case study

ICDM'07 Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications
Sampling for information and structure preservation when mining large data bases

IBERAMIA'10 Proceedings of the 12th Ibero-American conference on Advances in artificial intelligence
Exploiting data access for dynamic fragmentation in data warehouse

International Journal of Intelligent Information and Database Systems
An automated search space reduction methodology for large databases

ICDM'13 Proceedings of the 13th international conference on Advances in Data Mining: applications and theoretical aspects

Quantified Score

Hi-index	0.00

Visualization

Abstract

Physical layout of data is a crucial determinant of performance in a data warehouse. The optimal clustering of data on disk, for minimizing expected I/O, depends on the query workload. In practice, we often have a reasonable sense of the likelihood of different classes of queries, e.g., 40% of the queries concern calls made from some specific telephone number in some month. In this paper, we address the problem of finding an optimal clustering of records of a fact table on disk, given an expected workload in the form of a probability distribution over query classes.Attributes in a data warehouse fact table typically have hierarchies defined on them (by means of auxiliary dimension tables). The product of the dimensional hierarchy levels forms a lattice and leads to a natural notion of query classes. Optimal clustering in this context is a combinatorially explosive problem with a huge search space (doubly exponential in number of hierarchy levels). We identify an important subclass of clustering strategies called lattice paths, and present a dynamic programming algorithm for finding the optimal lattice path clustering, in time linear in the lattice size. We additionally propose a technique called snaking, which when applied to a lattice path, always reduces its cost. For a representative class of star schemas, we show that for every workload, there is a snaked lattice path which is globally optimal. Further, we prove that the clustering obtained by applying snaking to the optimal lattice path is never much worse than the globally optimal snaked lattice path clustering. We complement our analyses and validate the practical utility of our techniques with experiments using TPC-D benchmark data.