New Algorithms for Parallelizing Relational Database Joins in the Presence of Data Skew

Authors:
J. L. Wolf;D. M. Dias;P. S. Yu;J. Turek
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
1994

Citing 6
Cited 6

A Parallel Hash Join Algorithm for Managing Data Skew

IEEE Transactions on Parallel and Distributed Systems
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
A Fast Selection Algorithm and the Problem of Optimum Distribution of Effort

Journal of the ACM (JACM)
Effectiveness of Parallel Joins

IEEE Transactions on Knowledge and Data Engineering
A Parallel Sort Merge Join Algorithm for Managing Data Skew

IEEE Transactions on Parallel and Distributed Systems
Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases

On Disk Allocation of Intermediate Query Results in Parallel Database Systems

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Handling data skew in parallel joins in shared-nothing systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient outer join data skew handling in parallel DBMS

Proceedings of the VLDB Endowment
An optimal skew-insensitive join and multi-join algorithm for distributed architectures

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
An efficient equi-semi-join algorithm for distributed architectures

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel processing is an attractive option for relational database systems. As in any parallel environment however, load balancing is a critical issue which affects overall performance. Load balancing for one common database operation in particular, the join of two relations, can be severely hampered for conventional parallel algorithms, due to a natural phenomenon known as data skew. In a pair of recent papers (J. Wolf et al., 1993; 1993), we described two new join algorithms designed to address the data skew problem. We propose significant improvements to both algorithms, increasing their effectiveness while simultaneously decreasing their execution times. The paper then focuses on the comparative performance of the improved algorithms and their more conventional counterparts. The new algorithms outperform their more conventional counterparts in the presence of just about any skew at all, dramatically so in cases of high skew.