Adaptive bitonic sorting: an optimal parallel algorithm for shared-memory machines
SIAM Journal on Computing
Introduction to algorithms
Introspective sorting and selection algorithms
Software—Practice & Experience
Communications of the ACM
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A performance-oriented data parallel virtual machine for GPUs
ACM SIGGRAPH 2006 Sketches
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Real-time approximate sorting for self shadowing and transparency in hair rendering
Proceedings of the 2008 symposium on Interactive 3D graphics and games
Sorting networks and their applications
AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
GPU-ABiSort: optimal parallel sorting on stream architectures
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Fast minimum spanning tree for large graphs on the GPU
Proceedings of the Conference on High Performance Graphics 2009
FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects
Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games
OWA operators in regression problems
IEEE Transactions on Fuzzy Systems
Journal of Real-Time Image Processing
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Parallel processing with CUDA in ceramic tiles classification
KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I
OpenCL and parallel primitives for digital TV applications
IBM Journal of Research and Development
Technical Section: View-dependent pruning for real-time rendering of trees
Computers and Graphics
Improving the speed and stability of the k-nearest neighbors method
Pattern Recognition Letters
Design and implementation of an efficient integer count sort in CUDA GPUs
Concurrency and Computation: Practice & Experience
A high-performance sorting algorithm for multicore single-instruction multiple-data processors
Software—Practice & Experience
GPU merge path: a GPU merging algorithm
Proceedings of the 26th ACM international conference on Supercomputing
Discrete range searching primitive for the GPU and its applications
Journal of Experimental Algorithmics (JEA)
kNN-Borůvka-GPU: a fast and scalable MST construction from kNN graphs on GPU
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Parallel Shellsort Algorithm for Many-Core GPUs with CUDA
International Journal of Grid and High Performance Computing
Comparison based sorting for systems with multiple GPUs
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Automatic synthesis of out-of-core algorithms
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Counting and occurrence sort for GPUs using an embedded language
Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Register level sort algorithm on multi-core SIMD processors
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Hardware acceleration of database operations
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Hi-index | 0.00 |
This paper presents an algorithm for fast sorting of large lists using modern GPUs. The method achieves high speed by efficiently utilizing the parallelism of the GPU throughout the whole algorithm. Initially, GPU-based bucketsort or quicksort splits the list into enough sublists then to be sorted in parallel using merge-sort. The algorithm is of complexity nlogn, and for lists of 8 M elements and using a single Geforce 8800 GTS-512, it is 2.5 times as fast as the bitonic sort algorithms, with standard complexity of n(logn)^2, which for a long time was considered to be the fastest for GPU sorting. It is 6 times faster than single CPU quicksort, and 10% faster than the recent GPU-based radix sort. Finally, the algorithm is further parallelized to utilize two graphics cards, resulting in yet another 1.8 times speedup.