Parallelizing the Data Cube

  • Authors:
  • Frank Dehne;Todd Eavis;Susanne Hambrusch;Andrew Rau-Chaplin

  • Affiliations:
  • School of Computer Science, Carleton University, Ottawa, Canada K1S 5B6. frank@dehne.net, www.dehne.net;Faculty of Computer Science, Dalhousie University, Halifax, Canada B3H 1W5. eavis@cs.dal.ca;Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA. seh@cs.purdue.edu, www.cs.purdue.edu/people/seh;Faculty of Computer Science, Dalhousie University, Halifax, Canada B3H 1W5. arc@cs.dal.ca, www.cs.dal.ca/∼arc

  • Venue:
  • Distributed and Parallel Databases - Special issue: Parallel and distributed data mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.