Percentile finding algorithm for multiple sorted runs
VLDB '89 Proceedings of the 15th international conference on Very large data bases
Quickly generating billion-record synthetic databases
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Balancing histogram optimality and practicality for query result size estimation
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Introspective sorting and selection algorithms
Software—Practice & Experience
An overview of query optimization in relational systems
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Approximate medians and other quantiles in one pass and with limited memory
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Multidimensional access methods
ACM Computing Surveys (CSUR)
Parallel sorting on a shared-nothing architecture using probabilistic splitting
PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
Optimal Sampling Strategies in Quicksort and Quickselect
SIAM Journal on Computing
Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Optimal Histograms with Quality Guarantees
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Histogram-Based Approximation of Set-Valued Query-Answers
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines
VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
ICALP '97 Proceedings of the 24th International Colloquium on Automata, Languages and Programming
C-store: a column-oriented DBMS
VLDB '05 Proceedings of the 31st international conference on Very large data bases
The history of histograms (abridged)
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Optimizing joins in a map-reduce environment
Proceedings of the 13th International Conference on Extending Database Technology
Proceedings of the VLDB Endowment
Massively parallel sort-merge joins in main memory multi-core database systems
Proceedings of the VLDB Endowment
Optimal splitters for temporal and multi-version databases
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Navigating big data with high-throughput, energy-efficient data partitioning
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Partitioning is an important step in several database algorithms, including sorting, aggregation, and joins. Partitioning is also fundamental for dividing work into equal-sized (or balanced) parallel subtasks. In this paper, we aim to find, materialize and maintain a set of partitioning elements (splitters) for a data set. Unlike traditional partitioning elements, our splitters define both inequality and equality partitions, which allows us to bound the size of the inequality partitions. We provide an algorithm for determining an optimal set of splitters from a sorted data set and show that it has time complexity O(k lg2 N), where k is the number of splitters requested and N is the size of the data set. We show how the algorithm can be extended to pairs of tables, so that joins can be partitioned into work units that have balanced cost. We demonstrate experimentally (a) that finding the optimal set of splitters can be done efficiently, and (b) that using the precomputed splitters can improve the time to sort a data set by up to 76%, with particular benefits in the presence of a few heavy hitters.