Parallel Star Join + DataIndexes: Efficient Query Processing in Data Warehouses and OLAP

Authors:
Anindya Datta;Debra VanderMeer;Krithi Ramamritham
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2002

Citing 20
Cited 6

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Parallel database systems: the future of high performance database systems

Communications of the ACM
On optimal processor allocation to support pipelined hash joins

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Using shared virtual memory for parallel join processing

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Red brick warehouse: a read-mostly RDBMS for open SMP platforms

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Multi-table joins through bitmapped join indices

ACM SIGMOD Record
Building the data warehouse (2nd ed.)

Building the data warehouse (2nd ed.)
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
An overview of data warehousing and OLAP technology

ACM SIGMOD Record
Improved query performance with variant indexes

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Comparative performance of parallel join algorithms

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
Parallel Database Techniques

Parallel Database Techniques
Query Processing in Parallel Relational Database Systems

Query Processing in Parallel Relational Database Systems
Effectiveness of Parallel Joins

IEEE Transactions on Knowledge and Data Engineering
Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins

IEEE Transactions on Knowledge and Data Engineering
Optimization of Parallel Execution for Multi-Join Queries

IEEE Transactions on Knowledge and Data Engineering
Teaching an OLTP Database Kernel Advanced Data Warehousing Techniques

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Oracle Parallel Warehouse Server

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Workload Balance and Page Access Scheduling For Parallel Joins In Shared-Nothing Systems

Proceedings of the Ninth International Conference on Data Engineering
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases

Features to consider in a data warehousing system

Communications of the ACM - Blueprint for the future of high-performance networking
Star join revisited: Performance internals for cluster architectures

Data & Knowledge Engineering
Architecture of Parallel Spatial Data Warehouse: Balancing Algorithm and Resumption of Data Extraction

Proceedings of the 2005 conference on Software Engineering: Evolution and Emerging Technologies
Materialized aR-Tree in Distributed Spatial Data Warehouse

Intelligent Data Analysis - Analysis of Symbolic and Spatial Data
Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

Cluster Computing
Indexing multiversion data warehouse: from ROWID-Based multiversion join index to bitmap-based multiversion join index

ADBIS'09 Proceedings of the 13th East European conference on Advances in Databases and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

On-Line Analytical Processing (OLAP) refers to the technologies that allow users to efficiently retrieve data from the data warehouse for decision-support purposes. Data warehouses tend to be extremely large驴it is quite possible for a data warehouse to be hundreds of gigabytes to terabytes in size. Queries tend to be complex and ad hoc, often requiring computationally expensive operations such as joins and aggregation. Given this, we are interested in developing strategies for improving query processing in data warehouses by exploring the applicability of parallel processing techniques. In particular, we exploit the natural partitionability of a star schema and render it even more efficient by applying DataIndexes驴a storage structure that serves both as an index as well as data and lends itself naturally to vertical partitioning of the data. Dataindexes are derived from the various special purpose access mechanisms currently supported in commercial OLAP products. Specifically, we propose a declustering strategy which incorporates both task and data partitioning and present the Parallel Star Join (PSJ) Algorithm, which provides a means to perform a star join in parallel using efficient operations involving only rowsets and projection columns. We compare the performance of the PSJ Algorithm with two parallel query processing strategies. The first is a parallel join strategy utilizing the Bitmap Join Index (BJI), arguably the state-of-the-art OLAP join structure in use today. For the second strategy we choose a well-known parallel join algorithm, namely the pipelined hash algorithm. To assist in the performance comparison, we first develop a cost model of the disk access and transmission costs for all three approaches. Performance comparisons show that the DataIndex-based approach leads to dramatically lower disk access costs than the BJI, as well as the hybrid hash approaches, in both speedup and scaleup experiments, while the hash-based approach outperforms the BJI in disk access costs. With regard to transmission overhead, our performance results show that PSJ and BJI outperform the hash-based approach. Overall, our parallel star join algorithm and dataindexes form a winning combination.