An effective algorithm for parallelizing sort merge joins in the presence of data skew

Authors:
Joel L. Wolf;Daniel M. Dias;Philip S. Yu
Affiliations:
P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center;P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center;P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center
Venue:
DPDS '90 Proceedings of the second international symposium on Databases in parallel and distributed systems
Year:
1990

Citing 19
Cited 22

Self-organizing heuristics for implicit data structures

SIAM Journal on Computing
Hardware Support for Advanced Data Management Systems

Computer
Database machines and database management

Database machines and database management
On multisystem coupling through function request shipping

IEEE Transactions on Software Engineering
Optimal parallel merging and sorting without memory conflicts

IEEE Transactions on Computers
Optimal allocation of multiple class resources in computer systems

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Effect of skew on join performance in parallel architectures

DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Percentile finding algorithm for multiple sorted runs

VLDB '89 Proceedings of the 15th international conference on Very large data bases
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Join and Semijoin Algorithms for a Multiprocessor Database Machine

ACM Transactions on Database Systems (TODS)
A Fast Selection Algorithm and the Problem of Optimum Distribution of Effort

Journal of the ACM (JACM)
Operating Systems Theory

Operating Systems Theory
Advanced Database Machine Architecture

Advanced Database Machine Architecture
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Limiting Factors of Join Performance on Parallel Processors

Proceedings of the Fifth International Conference on Data Engineering
Hashing Methods and Relational Algebra Operations

VLDB '84 Proceedings of the 10th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases

Parallel database systems: the future of database processing or a passing fad?

ACM SIGMOD Record - Directions for future database research & development
Parallel database systems: the future of high performance database systems

Communications of the ACM
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Scheduling parallelizable tasks: putting it all on the shelf

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Improving Disk Cache Hit-Ratios Through Cache Partitioning

IEEE Transactions on Computers
Approximate algorithms scheduling parallelizable tasks

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Optimal Partitioning of Cache Memory

IEEE Transactions on Computers
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Dynamic Load Balancing in Very Large Shared-Nothing Hypercube Database Computers

IEEE Transactions on Computers
A Parallel Hash Join Algorithm for Managing Data Skew

IEEE Transactions on Parallel and Distributed Systems
Estimating page fetches for index scans with finite LRU buffers

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Query Optimization in Multidatabase Systems

Distributed and Parallel Databases
Considering data skew factor in multi-way join query optimization for parallel execution

The VLDB Journal — The International Journal on Very Large Data Bases - Parallelism in database systems
Estimating page fetches for index scans with finite LRU buffers

The VLDB Journal — The International Journal on Very Large Data Bases
A Hybrid Estimator for Selectivity Estimation

IEEE Transactions on Knowledge and Data Engineering
Performance Issues in Distributed Query Processing

IEEE Transactions on Parallel and Distributed Systems
Optimization of Multi-Way Join Queries for Parallel Execution

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Performance Analysis of a Load Balancing Hash-Join Algorithm for a Shared Memory Multiprocessor

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
External sorting for index construction of large semantic web databases

Proceedings of the 2010 ACM Symposium on Applied Computing
Parallelizing join computations of SPARQL queries for large semantic web databases

Proceedings of the 2011 ACM Symposium on Applied Computing
Accelerating large semantic web databases by parallel join computations of SPARQL queries

ACM SIGAPP Applied Computing Review

Quantified Score

Hi-index	0.02

Visualization

Abstract

Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a processor to become overloaded. In this paper, we propose a parallel sort merge join algorithm which uses a divide-and-conquer approach to address the data skew problem. The proposed algorithm adds an extra scheduling phase to the usual sort, transfer and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution for data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase in a CPU-bound environment, and is shown to be very robust relative to the degree of data skew and the total number of processors.