Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort

  • Authors:
  • Hagen Peters;Ole Schulz-Hildebrandt;Norbert Luttenberger

  • Affiliations:
  • Research Group for Communication Systems, Department of Computer Science, Christian-Albrechts-University Kiel, Germany;Research Group for Communication Systems, Department of Computer Science, Christian-Albrechts-University Kiel, Germany;Research Group for Communication Systems, Department of Computer Science, Christian-Albrechts-University Kiel, Germany

  • Venue:
  • Concurrency and Computation: Practice & Experience
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

State-of-the-art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA (Compute Unified Device Architecture) increases their usability as high-performance co-processors for general-purpose computing. Sorting is well investigated in Computer Science in general, but (because of this new field of application for GPUs) there is a demand for high-performance parallel sorting algorithms that fit with the characteristics of the modern GPU-architecture. We present a high-performance in-place implementation of Batcher's bitonic sorting networks for CUDA-enabled GPUs. Therefore, we assigned compare/exchange operations to threads in a way that decreases low-performance global-memory access and makes efficient use of high-performance shared memory. This greatly increases the performance of this in-place, comparison-based sorting algorithm. Our implementation outperforms all other algorithms in our tests when sorting 64-bit keys. It is the fastest comparison-based GPU sorting algorithm for 32-bit keys, being only outperformed by (non-comparison-based) radix sort when sorting sequences larger than 223. Copyright © 2011 John Wiley & Sons, Ltd.