A logarithmic time sort for linear size networks
Journal of the ACM (JACM)
A comparison of sorting algorithms for the connection machine CM-2
SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Introspective sorting and selection algorithms
Software—Practice & Experience
Efficient selection algorithms on distributed memory computers
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Comparison Based Parallel Sorting Algorithm
ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 03
Bottom-Up Construction and 2:1 Balance Refinement of Linear Octrees in Parallel
SIAM Journal on Scientific Computing
Efficient implementation of sorting on multi-core SIMD CPU architecture
Proceedings of the VLDB Endowment
TritonSort: a balanced large-scale sorting system
Proceedings of the 8th USENIX conference on Networked systems design and implementation
A massively parallel adaptive fast multipole method on heterogeneous architectures
Communications of the ACM
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Algorithms for high-throughput disk-to-disk sorting
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
In this paper, we present HykSort, an optimized comparison sort for distributed memory architectures that attains more than 2× improvement over bitonic sort and samplesort. The algorithm is based on the hypercube quicksort, but instead of a binary recursion, we perform a k-way recursion in which the pivots are selected accurately with an iterative parallel select algorithm. The single-node sort is performed using a vectorized and multithreaded merge sort. The advantages of HykSort are lower communication costs, better load balancing, and avoidance of O(p)-collective communication primitives. We also present a staged communication samplesort, which is more robust than the original samplesort for large core counts. We conduct an experimental study in which we compare hypercube sort, bitonic sort, the original samplesort, the staged samplesort, and HykSort. We report weak and strong scaling results and study the effect of the grain size. It turns out that no single algorithm performs best and a hybridization strategy is necessary. As a highlight of our study, on our largest experiment on 262,144 AMD cores of the CRAY XK7 "Titan" platform at the Oak Ridge National Laboratory we sorted 8 trillion 32-bit integer keys in 37 seconds achieving 0.9TB/s effective throughput.