Parallel database systems: the future of high performance database systems
Communications of the ACM
Using shared virtual memory for parallel join processing
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Effectiveness of Parallel Joins
IEEE Transactions on Knowledge and Data Engineering
New Algorithms for Parallelizing Relational Database Joins in the Presence of Data Skew
IEEE Transactions on Knowledge and Data Engineering
A Parallel Sort Merge Join Algorithm for Managing Data Skew
IEEE Transactions on Parallel and Distributed Systems
Frequency-adaptive join for shared nothing machines
Progress in computer research
An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew
Proceedings of the Seventh International Conference on Data Engineering
VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Dynamic Join Product Skew Handling for Hash-Joins in Shared-Nothing Database Systems
Proceedings of the 4th International Conference on Database Systems for Advanced Applications (DASFAA)
Skew-Insensitive Parallel Algorithms for Relational Join
HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
Efficient outer join data skew handling in parallel DBMS
Proceedings of the VLDB Endowment
Skew-resistant parallel processing of feature-extracting scientific user-defined functions
Proceedings of the 1st ACM symposium on Cloud computing
Query processing in a DBMS for cluster systems
Programming and Computing Software
Query evaluation techniques for cluster database systems
ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Parallel evaluation of conjunctive queries
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A new framework for join product skew
RED'10 Proceedings of the Third international conference on Resource Discovery
Worst-case optimal join algorithms: [extended abstract]
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Adaptive and big data scale parallel execution in oracle
Proceedings of the VLDB Endowment
Skew strikes back: new developments in the theory of join algorithms
ACM SIGMOD Record
Balancing reducer workload for skewed data using sampling-based partitioning
Computers and Electrical Engineering
Hi-index | 0.00 |
Parallel processing continues to be important in large data warehouses. The processing requirements continue to expand in multiple dimensions. These include greater volumes, increasing number of concurrent users, more complex queries, and more applications which define complex logical, semantic, and physical data models. Shared nothing parallel database management systems [16] can scale up "horizontally" by adding more nodes. Most parallel algorithms, however, do not take into account data skew. Data skew occurs naturally in many applications. A query processing skewed data not only slows down its response time, but generates hot nodes, which become a bottleneck throttling the overall system performance. Motivated by real business problems, we propose a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system. Our experimental results show that PRPD significantly speeds up query elapsed time in the presence of data skew. Our experience shows that eliminating system bottlenecks caused by data skew improves the throughput of the whole system which is important in parallel data warehouses that often run high concurrency workloads.