Self-organizing heuristics for implicit data structures
SIAM Journal on Computing
Database machines and database management
Database machines and database management
On multisystem coupling through function request shipping
IEEE Transactions on Software Engineering
Optimal parallel merging and sorting without memory conflicts
IEEE Transactions on Computers
Optimal allocation of multiple class resources in computer systems
SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Effect of skew on join performance in parallel architectures
DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Percentile finding algorithm for multiple sorted runs
VLDB '89 Proceedings of the 15th international conference on Very large data bases
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Join and Semijoin Algorithms for a Multiprocessor Database Machine
ACM Transactions on Database Systems (TODS)
A Fast Selection Algorithm and the Problem of Optimum Distribution of Effort
Journal of the ACM (JACM)
Operating Systems Theory
Advanced Database Machine Architecture
Advanced Database Machine Architecture
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
Limiting Factors of Join Performance on Parallel Processors
Proceedings of the Fifth International Conference on Data Engineering
Hashing Methods and Relational Algebra Operations
VLDB '84 Proceedings of the 10th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
Parallel database systems: the future of database processing or a passing fad?
ACM SIGMOD Record - Directions for future database research & development
Parallel database systems: the future of high performance database systems
Communications of the ACM
Sequential sampling procedures for query size estimation
SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Scheduling parallelizable tasks: putting it all on the shelf
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Improving Disk Cache Hit-Ratios Through Cache Partitioning
IEEE Transactions on Computers
Approximate algorithms scheduling parallelizable tasks
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Optimal Partitioning of Cache Memory
IEEE Transactions on Computers
Query evaluation techniques for large databases
ACM Computing Surveys (CSUR)
Dynamic Load Balancing in Very Large Shared-Nothing Hypercube Database Computers
IEEE Transactions on Computers
A Parallel Hash Join Algorithm for Managing Data Skew
IEEE Transactions on Parallel and Distributed Systems
Estimating page fetches for index scans with finite LRU buffers
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Query Optimization in Multidatabase Systems
Distributed and Parallel Databases
Considering data skew factor in multi-way join query optimization for parallel execution
The VLDB Journal — The International Journal on Very Large Data Bases - Parallelism in database systems
Estimating page fetches for index scans with finite LRU buffers
The VLDB Journal — The International Journal on Very Large Data Bases
A Hybrid Estimator for Selectivity Estimation
IEEE Transactions on Knowledge and Data Engineering
Performance Issues in Distributed Query Processing
IEEE Transactions on Parallel and Distributed Systems
Optimization of Multi-Way Join Queries for Parallel Execution
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Performance Analysis of a Load Balancing Hash-Join Algorithm for a Shared Memory Multiprocessor
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
External sorting for index construction of large semantic web databases
Proceedings of the 2010 ACM Symposium on Applied Computing
Parallelizing join computations of SPARQL queries for large semantic web databases
Proceedings of the 2011 ACM Symposium on Applied Computing
Accelerating large semantic web databases by parallel join computations of SPARQL queries
ACM SIGAPP Applied Computing Review
Hi-index | 0.02 |
Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a processor to become overloaded. In this paper, we propose a parallel sort merge join algorithm which uses a divide-and-conquer approach to address the data skew problem. The proposed algorithm adds an extra scheduling phase to the usual sort, transfer and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution for data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase in a CPU-bound environment, and is shown to be very robust relative to the degree of data skew and the total number of processors.