Optimal splitters for database partitioning with size bounds

Authors:
Kenneth A. Ross;John Cieslewicz
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY
Venue:
Proceedings of the 12th International Conference on Database Theory
Year:
2009

Citing 21
Cited 5

Percentile finding algorithm for multiple sorted runs

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Quickly generating billion-record synthetic databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Introspective sorting and selection algorithms

Software—Practice & Experience
An overview of query optimization in relational systems

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Approximate medians and other quantiles in one pass and with limited memory

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Multidimensional access methods

ACM Computing Surveys (CSUR)
Parallel sorting on a shared-nothing architecture using probabilistic splitting

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
Optimal Sampling Strategies in Quicksort and Quickselect

SIAM Journal on Computing
Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Histogram-Based Approximation of Set-Valued Query-Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC)

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Efficient Array Partitioning

ICALP '97 Proceedings of the 24th International Colloquium on Automata, Languages and Programming
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
Slicing long-running queries

Proceedings of the VLDB Endowment
Massively parallel sort-merge joins in main memory multi-core database systems

Proceedings of the VLDB Endowment
Optimal splitters for temporal and multi-version databases

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Navigating big data with high-throughput, energy-efficient data partitioning

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Partitioning is an important step in several database algorithms, including sorting, aggregation, and joins. Partitioning is also fundamental for dividing work into equal-sized (or balanced) parallel subtasks. In this paper, we aim to find, materialize and maintain a set of partitioning elements (splitters) for a data set. Unlike traditional partitioning elements, our splitters define both inequality and equality partitions, which allows us to bound the size of the inequality partitions. We provide an algorithm for determining an optimal set of splitters from a sorted data set and show that it has time complexity O(k lg2 N), where k is the number of splitters requested and N is the size of the data set. We show how the algorithm can be extended to pairs of tables, so that joins can be partitioned into work units that have balanced cost. We demonstrate experimentally (a) that finding the optimal set of splitters can be done efficiently, and (b) that using the precomputed splitters can improve the time to sort a data set by up to 76%, with particular benefits in the presence of a few heavy hitters.