Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort

Authors:
Hagen Peters;Ole Schulz-Hildebrandt;Norbert Luttenberger
Affiliations:
Research Group for Communication Systems, Department of Computer Science, Christian-Albrechts-University Kiel, Germany;Research Group for Communication Systems, Department of Computer Science, Christian-Albrechts-University Kiel, Germany;Research Group for Communication Systems, Department of Computer Science, Christian-Albrechts-University Kiel, Germany
Venue:
Concurrency and Computation: Practice & Experience
Year:
2011

Citing 0
Cited 3

Fast box-counting algorithm on GPU

Computer Methods and Programs in Biomedicine
Efficient sorting design on a novel embedded parallel computing architecture with unique memory access

Computers and Electrical Engineering
Bitonic sort on a chained-cubic tree interconnection network

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

State-of-the-art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA (Compute Unified Device Architecture) increases their usability as high-performance co-processors for general-purpose computing. Sorting is well investigated in Computer Science in general, but (because of this new field of application for GPUs) there is a demand for high-performance parallel sorting algorithms that fit with the characteristics of the modern GPU-architecture. We present a high-performance in-place implementation of Batcher's bitonic sorting networks for CUDA-enabled GPUs. Therefore, we assigned compare/exchange operations to threads in a way that decreases low-performance global-memory access and makes efficient use of high-performance shared memory. This greatly increases the performance of this in-place, comparison-based sorting algorithm. Our implementation outperforms all other algorithms in our tests when sorting 64-bit keys. It is the fastest comparison-based GPU sorting algorithm for 32-bit keys, being only outperformed by (non-comparison-based) radix sort when sorting sequences larger than 223. Copyright © 2011 John Wiley & Sons, Ltd.