Optimal chunking of large multidimensional arrays for data warehousing

Authors:
E. J. Otoo;Doron Rotem;Sridhar Seshadri
Affiliations:
University of California, Berkeley, CA;University of California, Berkeley, CA;New York University, New York, NY
Venue:
Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
Year:
2007

Citing 11
Cited 3

Linear clustering of objects with multiple attributes

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
An array-based algorithm for simultaneous multidimensional aggregates

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
OLAP and statistical databases: similarities and differences

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Optimal partial-match retrieval when fields are independently specified

ACM Transactions on Database Systems (TODS)
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Efficient Organization of Large Multidimensional Arrays

Proceedings of the Tenth International Conference on Data Engineering
Physical Schemas for Large Multidimensional Arrays in Scientific Computing Applications

Proceedings of the Seventh International Working Conference on Scientific and Statistical Database Management
Extendible Arrays for Statistical Databases and OLAP Applications

SSDBM '96 Proceedings of the Eighth International Conference on Scientific and Statistical Database Management
SISYPHUS: the implementation of a chunk-based storage manager for OLAP data cubes

Data & Knowledge Engineering - Special issue: Advances in OLAP
Convex Optimization

Convex Optimization
Efficient Storage Allocation of Large-Scale Extendible Multi-dimensional Scientific Datasets

SSDBM '06 Proceedings of the 18th International Conference on Scientific and Statistical Database Management

Report on the Tenth ACM International Workshop on Data Warehousing and OLAP (DOLAP'07)

ACM SIGMOD Record
ArrayStore: a storage manager for complex parallel array processing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Normalised LCS-based method for indexing multidimensional data cube

International Journal of Intelligent Information and Database Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Very large multidimensional arrays are commonly used in data intensive scientific computations as well ason-line analytical processing applications referred to as MOLAP. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that access all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give theminimum expected number of chunks over a query workload?" The problem of optimal chunking was first introduced by Sarawagi and Stonebraker [11] who gave an approximate solution. In this paper we develop exact mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic and real life workloads, show that our solutions are consistently within than 2.0% of the true number of chunks retrieved for any number of dimensions. In contrast, the approximate solution of [11] can deviate considerably from the true result with increasing number of dimensions and also may lead suboptimal chunk shapes.