Fast box-counting algorithm on GPU
Computer Methods and Programs in Biomedicine
Computers and Electrical Engineering
Bitonic sort on a chained-cubic tree interconnection network
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
State-of-the-art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA (Compute Unified Device Architecture) increases their usability as high-performance co-processors for general-purpose computing. Sorting is well investigated in Computer Science in general, but (because of this new field of application for GPUs) there is a demand for high-performance parallel sorting algorithms that fit with the characteristics of the modern GPU-architecture. We present a high-performance in-place implementation of Batcher's bitonic sorting networks for CUDA-enabled GPUs. Therefore, we assigned compare/exchange operations to threads in a way that decreases low-performance global-memory access and makes efficient use of high-performance shared memory. This greatly increases the performance of this in-place, comparison-based sorting algorithm. Our implementation outperforms all other algorithms in our tests when sorting 64-bit keys. It is the fastest comparison-based GPU sorting algorithm for 32-bit keys, being only outperformed by (non-comparison-based) radix sort when sorting sequences larger than 223. Copyright © 2011 John Wiley & Sons, Ltd.