Parallelizing the Data Cube

Authors:
Frank Dehne;Todd Eavis;Susanne Hambrusch;Andrew Rau-Chaplin
Affiliations:
School of Computer Science, Carleton University, Ottawa, Canada K1S 5B6. frank@dehne.net, www.dehne.net;Faculty of Computer Science, Dalhousie University, Halifax, Canada B3H 1W5. eavis@cs.dal.ca;Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA. seh@cs.purdue.edu, www.cs.purdue.edu/people/seh;Faculty of Computer Science, Dalhousie University, Halifax, Canada B3H 1W5. arc@cs.dal.ca, www.cs.dal.ca/∼arc
Venue:
Distributed and Parallel Databases - Special issue: Parallel and distributed data mining
Year:
2002

Citing 25
Cited 12

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Optimal algorithms for tree partitioning

SODA '91 Proceedings of the second annual ACM-SIAM symposium on Discrete algorithms
Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Scalable parallel geometric algorithms for coarse grained multicomputers

SCG '93 Proceedings of the ninth annual symposium on Computational geometry
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Towards efficiency and portability: programming with the BSP model

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
An array-based algorithm for simultaneous multidimensional aggregates

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Efficient external memory algorithms by simulating coarse-grained parallel algorithms

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
External memory algorithms

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
BSPlib: The BSP programming library

Parallel Computing
Parallel virtual memory

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
A Shifting Algorithm for Min-Max Tree Partitioning

Journal of the ACM (JACM)
Iceberg-cube computation with PC clusters

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
High Performance OLAP and Data Mining on Parallel Computers

Data Mining and Knowledge Discovery
Reducing I/O Complexity by Simulating Coarse Grained Parallel Algorithms

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Fast Computation of Sparse Datacubes

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
On the Computation of Multidimensional Aggregates

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Multi-Cube Computation

DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
BSP-Like External-Memory Computation

CIAC '97 Proceedings of the Third Italian Conference on Algorithms and Complexity
Bulk synchronous parallel computing-a paradigm for transportable software

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Supporting I/O-efficient scientific computation in TPIE

SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
A Parallel Scalable Infrastructure for OLAP and Data Mining

IDEAS '99 Proceedings of the 1999 International Symposium on Database Engineering & Applications

Implementing data cube construction using a cluster middleware: algorithms, implementation experience, and performance evaluation

Future Generation Computer Systems - Selected papers from CCGRID 2002
Parallel ROLAP Data Cube Construction on Shared-Nothing Multiprocessors

Distributed and Parallel Databases
Communication and Memory Optimal Parallel Data Cube Construction

IEEE Transactions on Parallel and Distributed Systems
The cgmCUBE project: Optimizing parallel data cube generation for ROLAP

Distributed and Parallel Databases
PnP: sequential, external memory, and parallel iceberg cube computation

Distributed and Parallel Databases
Enabling OLAP in mobile environments via intelligent data cube compression techniques

Journal of Intelligent Information Systems
Parallel OLAP with the Sidera server

Future Generation Computer Systems
A cubic-wise balance approach for privacy preservation in data cubes

Information Sciences: an International Journal
Sidera: a cluster-based server for online analytical processing

OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II
Developing high-performance parallel applications using EPAS

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Parallel Real-Time OLAP on Multi-core Processors

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A New Parallel Data Cube Construction Scheme

International Journal of Grid and High Performance Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.