Effect of skew on join performance in parallel architectures

Authors:
M. Seetha Lakshmi;P. S. Yu
Affiliations:
IBM Research Division, T.J. Watson Research Center, P.O. Box 704, Yorktown Heights, N.Y.;IBM Research Division, T.J. Watson Research Center, P.O. Box 704, Yorktown Heights, N.Y.
Venue:
DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
Year:
2000

Citing 7
Cited 24

Join processing in database systems with large main memories

ACM Transactions on Database Systems (TODS)
A Performance Comparison of Multimicro and Mainframe Database Architectures

IEEE Transactions on Software Engineering
Join and Semijoin Algorithms for a Multiprocessor Database Machine

ACM Transactions on Database Systems (TODS)
Implementation techniques for main memory database systems

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Limiting Factors of Join Performance on Parallel Processors

Proceedings of the Fifth International Conference on Data Engineering
Hashing Methods and Relational Algebra Operations

VLDB '84 Proceedings of the 10th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases

Why a single parallelization strategy is not enough in knowledge bases

PODS '89 Proceedings of the eighth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Join processing in relational databases

ACM Computing Surveys (CSUR)
Frame-sliced partitioned parallel signature files

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Dynamic Load Balancing in Very Large Shared-Nothing Hypercube Database Computers

IEEE Transactions on Computers
An effective algorithm for parallelizing sort merge joins in the presence of data skew

DPDS '90 Proceedings of the second international symposium on Databases in parallel and distributed systems
Considering data skew factor in multi-way join query optimization for parallel execution

The VLDB Journal — The International Journal on Very Large Data Bases - Parallelism in database systems
Effectiveness of Parallel Joins

IEEE Transactions on Knowledge and Data Engineering
A Graph Theoretical Approach to Determine a Join Reducer Sequence in Distributed Query Processing

IEEE Transactions on Knowledge and Data Engineering
Dynamic Load Balancing in Multicomputer Database Systems Using Partition Tuning

IEEE Transactions on Knowledge and Data Engineering
The Adaptive-Hash Join Algorithm for a Hypercube Multicomputer

IEEE Transactions on Parallel and Distributed Systems
A Virtual Bus Architecture for Dynamic Parallel Processing

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Disk Arrays under Failure

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
An Adaptive Data Placement Scheme for Parallel Database Computer Systems

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Hash-Based Join Algorithms for Multiprocessor Computers

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC)

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Performance Analysis of a Load Balancing Hash-Join Algorithm for a Shared Memory Multiprocessor

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Query processing in a DBMS for cluster systems

Programming and Computing Software
Query evaluation techniques for cluster database systems

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Introducing skew into the TPC-H benchmark

TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
Variations of the star schema benchmark to test the effects of data skew on query performance

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Skew in the distribution of values taken by an attribute is identified as a major factor that can affect the performance of parallel architectures for relational joins. The effect of skew on the performance of two parallel architectures is evaluated using analytic models. In one architecture, called database machine (DBMC), data as well as processing power are distributed; while in the other architecture, called Single Processor Parallel Input/output (SPPI), data is distributed but the processing power is concentrated in one processor. The two architectures are compared in terms of the ratio of MIPS used by DBMC and SPPI to deliver the same throughput and response time. In addition, the horizontal growth potential of DBMC is evaluated in terms of maximum speedup achievable by DBMC relative to SPPI response time. The MIPS ratio as well as speedup are found to be very sensitive to the amount of skew. These suggest, careful thought should be given in parallelizing database applications and in the design of algorithms and query optimizer for parallel architectures.