An Infrastructure for Scalable Parallel Multidimensional Analysis

  • Authors:
  • Sanjay Goil;Alok Choudhary

  • Affiliations:
  • -;-

  • Venue:
  • SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multidimensional Analysis in On-Line Analytical Processing (OLAP), and Scientific and statistical databases (SSDB) use operations requiring summary information on multi-dimensional data sets. Most common are aggregate operations along one or more dimensions of numerical data values and/or on hierarchies defined on them. Simultaneous calculation of multi-dimensional aggregates are provided by the Data Cube operator, used to calculate and store summary information on a number of dimensions. This is computed only partially if the number of dimensions is large since a few dimensions are typical for analysis over summary information. Queries may either be answered from a materialized cube or calculated on the fly.The multi-dimensionality of the underlying problem can be represented both in relational and multi-dimensional databases, the latter being a better fit when query performance is the criteria for judgement. Relational databases are scalable in size for OLAP and multidimensional analysis and efforts are on to make their performance acceptable.On the other hand multi-dimensional databases have proven to provide good performance for such queries, although they are not very scalable. In this paper we address scalability in multi-dimensional systems for analysis in SSDB and OLAP applications. We describe our system PARSIMONY - Parallel and Scalable Infrastructure for Multidimensional Online analytical processing. Sparsity of data sets is handled by using chunks to store data as a sparse set using a Bit encoded sparse structure. Chunks provide a multi-dimensional index structure for efficient dimension oriented data accesses much the same as multi-dimensional arrays do. Operations within chunks and between chunks are a combination of relational and multi-dimensional operations depending on whether the chunk is sparse or dense.Performance results for high dimensional data sets on a distributed memory parallel machine (IBM SP-2) show good speedup and scalability.