Parallel ROLAP Data Cube Construction on Shared-Nothing Multiprocessors

Authors:
Ying Chen;Frank Dehne;Todd Eavis;Andrew Rau-Chaplin
Affiliations:
Faculty of Computer Science, Dalhousie University, Halifax, Canada. ychen@cs.dal.ca;School of Computer Science, Carleton University, Ottawa, Canada. frank@dehne.net;Faculty of Computer Science, Dalhousie University, Halifax, Canada. eavis@cs.dal.ca;Faculty of Computer Science, Dalhousie University, Halifax, Canada. arc@cs.dal.ca
Venue:
Distributed and Parallel Databases
Year:
2004

Citing 18
Cited 8

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
On the versatility of parallel sorting by regular sampling

Parallel Computing
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
An array-based algorithm for simultaneous multidimensional aggregates

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
High performance multidimensional analysis of large datasets

Proceedings of the 1st ACM international workshop on Data warehousing and OLAP
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A dynamic load balancing strategy for parallel datacube computation

Proceedings of the 2nd ACM international workshop on Data warehousing and OLAP
Iceberg-cube computation with PC clusters

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
Parallelizing the Data Cube

Distributed and Parallel Databases - Special issue: Parallel and distributed data mining
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
High Performance OLAP and Data Mining on Parallel Computers

Data Mining and Knowledge Discovery
Fast Computation of Sparse Datacubes

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
On the Computation of Multidimensional Aggregates

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Multi-Cube Computation

DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
A Cluster Architecture for Parallel Data Warehousing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
A Parallel Scalable Infrastructure for OLAP and Data Mining

IDEAS '99 Proceedings of the 1999 International Symposium on Database Engineering & Applications

The cgmCUBE project: Optimizing parallel data cube generation for ROLAP

Distributed and Parallel Databases
PnP: sequential, external memory, and parallel iceberg cube computation

Distributed and Parallel Databases
Cooperative caching for grid-enabled OLAP

International Journal of Grid and Utility Computing
Parallel OLAP with the Sidera server

Future Generation Computer Systems
Sidera: a cluster-based server for online analytical processing

OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II
Distributed construction of data cubes from tuple stream

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
Distributed construction of data cubes from tuple stream

International Journal of Business Intelligence and Data Mining
A New Parallel Data Cube Construction Scheme

International Journal of Grid and High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The pre-computation of data cubes is critical to improving the response time of On-Line Analytical Processing (OLAP) systems and can be instrumental in accelerating data mining tasks in large data warehouses. In order to meet the need for improved performance created by growing data sizes, parallel solutions for generating the data cube are becoming increasingly important. This paper presents a parallel method for generating data cubes on a shared-nothing multiprocessor. Since no (expensive) shared disk is required, our method can be used on low cost Beowulf style clusters consisting of standard PCs with local disks connected via a data switch. Our approach uses a ROLAP representation of the data cube where views are stored as relational tables. This allows for tight integration with current relational database technology.We have implemented our parallel shared-nothing data cube generation method and evaluated it on a PC cluster, exploring relative speedup, local vs. global schedule trees, data skew, cardinality of dimensions, data dimensionality, and balance tradeoffs. For an input data set of 2,000,000 rows (72 Megabytes), our parallel data cube generation method achieves close to optimal speedup; generating a full data cube of ≈227 million rows (5.6 Gigabytes) on a 16 processors cluster in under 6 minutes. For an input data set of 10,000,000 rows (360 Megabytes), our parallel method, running on a 16 processor PC cluster, created a data cube consisting of ≈846 million rows (21.7 Gigabytes) in under 47 minutes.