Implementing data cube construction using a cluster middleware: algorithms, implementation experience, and performance evaluation

Authors:
Ge Yang;Ruoming Jin;Gagan Agrawal
Affiliations:
Department of Computer and Information Sciences, Ohio State University, Columbus, OH;Department of Computer and Information Sciences, Ohio State University, Columbus, OH;Department of Computer and Information Sciences, Ohio State University, Columbus, OH
Venue:
Future Generation Computer Systems - Selected papers from CCGRID 2002
Year:
2003

Citing 11
Cited 2

Introduction to algorithms

Introduction to algorithms
An array-based algorithm for simultaneous multidimensional aggregates

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
T2: a customizable parallel database for multi-dimensional data

ACM SIGMOD Record
PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining

Journal of Parallel and Distributed Computing - Special issue on high-performance data mining
Iceberg-cube computation with PC clusters

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Parallelizing the Data Cube

Distributed and Parallel Databases - Special issue: Parallel and distributed data mining
Titan: A High-Performance Remote Sensing Database

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Efficient Organization of Large Multidimensional Arrays

Proceedings of the Tenth International Conference on Data Engineering
Infrastructure for Building Parallel Database Systems for Multi-Dimensional Data

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Fast Computation of Sparse Datacubes

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
On the Computation of Multidimensional Aggregates

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Promoting performance and separation of concerns for data mining applications on the grid

Future Generation Computer Systems - Special section: Data mining in grid computing environments
A New Parallel Data Cube Construction Scheme

International Journal of Grid and High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

With increases in the amount of data available for analysis in commercial settings, on line analytical processing (OLAP) and decision support have become important applications for high performance computing. Implementing such applications on clusters requires a lot of expertise and effort, particularly because of the sizes of input and output datasets. In this paper, we describe our experiences in developing one such application using a cluster middleware, called ADR. We focus on the problem of data cube construction , which commonly arises in multi-dimensional OLAP. We show how ADR, originally developed for scientific data intensive applications, can be used for carrying out an efficient and scalable data cube construction implementation. A particular issue with the use of ADR is tiling of output datasets. We present new algorithms that combine interprocessor communication and tiling within each processor. These algorithms preserve the important properties that are desirable from any parallel data cube construction algorithm. We have carried out a detailed evaluation of our implementation. The main results from our experiments are as follows: (1) high speedups are achieved on both dense and sparse datasets, even though we have used simple algorithms that sequentialize a part of the computation; (2) the execution time depends only upon the amount of computation, and does not increase in a super-linear fashion as the dataset size or the number of tiles increases; and (3) as the datasets become more sparse, sequential performance degrades, but the parallel speedups are still quite good.As part of our on-going work in this area, we are also looking at handling a larger number of dimensions and multi-dimensional partitionings. We describe our preliminary theoretical and experimental work in this direction.