Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

Authors:
Chun-Yuan Lin;Wei Sheng Lee;Chuan Yi Tang
Affiliations:
Chang Gung University, Taiwan;National Tsing Hua University, Taiwan;Providence University, Taiwan
Venue:
International Journal of Grid and High Performance Computing
Year:
2012

Citing 13
Cited 0

Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
A randomized parallel sorting algorithm with an experimental study

Journal of Parallel and Distributed Computing
Analysis of Shellsort and Related Algorithms

ESA '96 Proceedings of the Fourth Annual European Symposium on Algorithms
Best Increments for the Average Case of Shellsort

FCT '01 Proceedings of the 13th International Symposium on Fundamentals of Computation Theory
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Fast parallel GPU-sorting using a hybrid algorithm

Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Practical Quicksort Algorithm for Graphics Processors

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A Fast and Flexible Sorting Algorithm with CUDA

ICA3PP '09 Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing
Fast in-place sorting with CUDA based on bitonic sort

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sorting is a classic algorithmic problem and its importance has led to the design and implementation of various sorting algorithms on many-core graphics processing units GPUs. CUDPP Radix sort is the most efficient sorting on GPUs and GPU Sample sort is the best comparison-based sorting. Although the implementations of these algorithms are efficient, they either need an extra space for the data rearrangement or the atomic operation for the acceleration. Sorting applications usually deal with a large amount of data, thus the memory utilization is an important consideration. Furthermore, these sorting algorithms on GPUs without the atomic operation support can result in the performance degradation or fail to work. In this paper, an efficient implementation of a parallel shellsort algorithm, CUDA shellsort, is proposed for many-core GPUs with CUDA. Experimental results show that, on average, the performance of CUDA shellsort is nearly twice faster than GPU quicksort and 37% faster than Thrust mergesort under uniform distribution. Moreover, its performance is the same as GPU sample sort up to 32 million data elements, but only needs a constant space usage. CUDA shellsort is also robust over various data distributions and could be suitable for other many-core architectures.