Introspective sorting and selection algorithms
Software—Practice & Experience
A new deterministic parallel sorting algorithm with an experimental evaluation
Journal of Experimental Algorithmics (JEA)
The Art of Computer Programming Volumes 1-3 Boxed Set
The Art of Computer Programming Volumes 1-3 Boxed Set
Implementing sorting in database systems
ACM Computing Surveys (CSUR)
Fast parallel GPU-sorting using a hybrid algorithm
Journal of Parallel and Distributed Computing
Mars: a MapReduce framework on graphics processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Clustering billions of data points using GPUs
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Designing efficient sorting algorithms for manycore GPUs
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Accelerating SQL database operations on a GPU with CUDA
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
An asymmetric distributed shared memory model for heterogeneous parallel systems
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Multi-GPU MapReduce on GPU Clusters
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
GPU merge path: a GPU merging algorithm
Proceedings of the 26th ACM international conference on Supercomputing
MCSTL: the multi-core standard template library
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system. In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.