Handling data skew in parallel joins in shared-nothing systems

Authors:
Yu Xu;Pekka Kostamaa;Xin Zhou;Liang Chen
Affiliations:
Teradata, San Diego, CA, USA;Teradata, San Diego, CA, USA;Teradata, San Diego, CA, USA;UCSD, San Diego, CA, USA
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 14
Cited 10

Parallel database systems: the future of high performance database systems

Communications of the ACM
Using shared virtual memory for parallel join processing

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew

PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Effectiveness of Parallel Joins

IEEE Transactions on Knowledge and Data Engineering
New Algorithms for Parallelizing Relational Database Joins in the Presence of Data Skew

IEEE Transactions on Knowledge and Data Engineering
A Parallel Sort Merge Join Algorithm for Managing Data Skew

IEEE Transactions on Parallel and Distributed Systems
Frequency-adaptive join for shared nothing machines

Progress in computer research
An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew

Proceedings of the Seventh International Conference on Data Engineering
Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC)

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Dynamic Join Product Skew Handling for Hash-Joins in Shared-Nothing Database Systems

Proceedings of the 4th International Conference on Database Systems for Advanced Applications (DASFAA)
Skew-Insensitive Parallel Algorithms for Relational Join

HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing

Efficient outer join data skew handling in parallel DBMS

Proceedings of the VLDB Endowment
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Query processing in a DBMS for cluster systems

Programming and Computing Software
Query evaluation techniques for cluster database systems

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Parallel evaluation of conjunctive queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A new framework for join product skew

RED'10 Proceedings of the Third international conference on Resource Discovery
Worst-case optimal join algorithms: [extended abstract]

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Adaptive and big data scale parallel execution in oracle

Proceedings of the VLDB Endowment
Skew strikes back: new developments in the theory of join algorithms

ACM SIGMOD Record
Balancing reducer workload for skewed data using sampling-based partitioning

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel processing continues to be important in large data warehouses. The processing requirements continue to expand in multiple dimensions. These include greater volumes, increasing number of concurrent users, more complex queries, and more applications which define complex logical, semantic, and physical data models. Shared nothing parallel database management systems [16] can scale up "horizontally" by adding more nodes. Most parallel algorithms, however, do not take into account data skew. Data skew occurs naturally in many applications. A query processing skewed data not only slows down its response time, but generates hot nodes, which become a bottleneck throttling the overall system performance. Motivated by real business problems, we propose a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system. Our experimental results show that PRPD significantly speeds up query elapsed time in the presence of data skew. Our experience shows that eliminating system bottlenecks caused by data skew improves the throughput of the whole system which is important in parallel data warehouses that often run high concurrency workloads.