HykSort: a new variant of hypercube quicksort on distributed memory architectures

Authors:
Hari Sundar;Dhairya Malhotra;George Biros
Affiliations:
University of Texas at Austin, Austin, TX, USA;University of Texas at Austin, Austin, TX, USA;University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 27th international ACM conference on International conference on supercomputing
Year:
2013

Citing 10
Cited 1

A logarithmic time sort for linear size networks

Journal of the ACM (JACM)
A comparison of sorting algorithms for the connection machine CM-2

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Introspective sorting and selection algorithms

Software—Practice & Experience
Efficient selection algorithms on distributed memory computers

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Comparison Based Parallel Sorting Algorithm

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 03
Bottom-Up Construction and 2:1 Balance Refinement of Linear Octrees in Parallel

SIAM Journal on Scientific Computing
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
TritonSort: a balanced large-scale sorting system

Proceedings of the 8th USENIX conference on Networked systems design and implementation
A massively parallel adaptive fast multipole method on heterogeneous architectures

Communications of the ACM
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Algorithms for high-throughput disk-to-disk sorting

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present HykSort, an optimized comparison sort for distributed memory architectures that attains more than 2× improvement over bitonic sort and samplesort. The algorithm is based on the hypercube quicksort, but instead of a binary recursion, we perform a k-way recursion in which the pivots are selected accurately with an iterative parallel select algorithm. The single-node sort is performed using a vectorized and multithreaded merge sort. The advantages of HykSort are lower communication costs, better load balancing, and avoidance of O(p)-collective communication primitives. We also present a staged communication samplesort, which is more robust than the original samplesort for large core counts. We conduct an experimental study in which we compare hypercube sort, bitonic sort, the original samplesort, the staged samplesort, and HykSort. We report weak and strong scaling results and study the effect of the grain size. It turns out that no single algorithm performs best and a hybridization strategy is necessary. As a highlight of our study, on our largest experiment on 262,144 AMD cores of the CRAY XK7 "Titan" platform at the Oak Ridge National Laboratory we sorted 8 trillion 32-bit integer keys in 37 seconds achieving 0.9TB/s effective throughput.