Optimal chunking of large multidimensional arrays for data warehousing

  • Authors:
  • E. J. Otoo;Doron Rotem;Sridhar Seshadri

  • Affiliations:
  • University of California, Berkeley, CA;University of California, Berkeley, CA;New York University, New York, NY

  • Venue:
  • Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Very large multidimensional arrays are commonly used in data intensive scientific computations as well ason-line analytical processing applications referred to as MOLAP. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that access all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give theminimum expected number of chunks over a query workload?" The problem of optimal chunking was first introduced by Sarawagi and Stonebraker [11] who gave an approximate solution. In this paper we develop exact mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic and real life workloads, show that our solutions are consistently within than 2.0% of the true number of chunks retrieved for any number of dimensions. In contrast, the approximate solution of [11] can deviate considerably from the true result with increasing number of dimensions and also may lead suboptimal chunk shapes.