A Parallel Hash Join Algorithm for Managing Data Skew

Authors:
Joel L. Wolf;Philip S. Yu;John Turek;Daniel M. Dias
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1993

Citing 27
Cited 14

Database machines and database management

Database machines and database management
On multisystem coupling through function request shipping

IEEE Transactions on Software Engineering
A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Effectiveness of parallel processing database systems

Computer Systems Science and Engineering
The effect of bucket size tuning in the dynamic hybrid GRACE hash join method

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (SDC)

Proceedings of the sixteenth international conference on Very large databases
Scheduling parallelizable tasks: putting it all on the shelf

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Approximate algorithms scheduling parallelizable tasks

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Adaptive access path selection for relational database systems

Computer Systems Science and Engineering
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Join and Semijoin Algorithms for a Multiprocessor Database Machine

ACM Transactions on Database Systems (TODS)
Performance evaluation of functional disk system with nonuniform data distribution

DPDS '90 Proceedings of the second international symposium on Databases in parallel and distributed systems
An effective algorithm for parallelizing sort merge joins in the presence of data skew

DPDS '90 Proceedings of the second international symposium on Databases in parallel and distributed systems
Comparative performance of parallel join algorithms

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
Advanced Database Machine Architecture

Advanced Database Machine Architecture
Prototyping Bubba, A Highly Parallel Database System

IEEE Transactions on Knowledge and Data Engineering
The Gamma Database Machine Project

IEEE Transactions on Knowledge and Data Engineering
Effectiveness of Parallel Joins

IEEE Transactions on Knowledge and Data Engineering
A Parallel Sort Merge Join Algorithm for Managing Data Skew

IEEE Transactions on Parallel and Distributed Systems
Limiting Factors of Join Performance on Parallel Processors

Proceedings of the Fifth International Conference on Data Engineering
Hashing Methods and Relational Algebra Operations

VLDB '84 Proceedings of the 10th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
The Design of XPRS

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
Hash-Partitioned Join Method Using Dynamic Destaging Strategy

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases

Scheduling multiple queries on a parallel machine

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
DASD dancing: a disk load balancing optimization scheme for video-on-demand computer systems

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Parallel Execution of Hash Joins in Parallel Databases

IEEE Transactions on Parallel and Distributed Systems
Performance study on optimal processor assignment in parallel relational databases

ICS '97 Proceedings of the 11th international conference on Supercomputing
Snowball: Scalable Storage on Networks of Workstations with Balanced Load

Distributed and Parallel Databases
Performance evaluation of processor allocation algorithms for parallel query execution

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
File Assignment in Parallel I/O Systems with Minimal Variance of Service Time

IEEE Transactions on Computers
An Adaptive Parallel Distributive Join Algorithm on a Cluster of Workstations

The Journal of Supercomputing
New Algorithms for Parallelizing Relational Database Joins in the Presence of Data Skew

IEEE Transactions on Knowledge and Data Engineering
Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins

IEEE Transactions on Knowledge and Data Engineering
Criss-Cross Hash Joins: Design and Analysis

IEEE Transactions on Knowledge and Data Engineering
The impact of load balancing to object-oriented query execution scheduling in parallel machine environment

Information Sciences—Informatics and Computer Science: An International Journal
Scheduling malleable tasks with interdependent processing rates: Comments and observations

Discrete Applied Mathematics
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Presents a parallel hash join algorithm that is based on the concept of hierarchicalhashing, to address the problem of data skew. The proposed algorithm splits the usualhash phase into a hash phase and an explicit transfer phase, and adds an extrascheduling phase between these two. During the scheduling phase, a heuristicoptimization algorithm, using the output of the hash phase, attempts to balance the loadacross the multiple processors in the subsequent join phase. The algorithm naturallyidentifies the hash partitions with the largest skew values and splits them as necessary,assigning each of them to an optimal number of processors. Assuming for concreteness aZipf-like distribution of the values in the join column, a join phase which is CPU-bound,and a shared nothing environment, the algorithm is shown to achieve good join phaseload balancing, and to be robust relative to the degree of data skew and the totalnumber of processors. The overall speedup due to this algorithm is compared to someexisting parallel hash join methods. The proposed method does considerably better in high skew situations.