Efficient parallel algorithms
SIAM Journal on Computing
Adaptive bitonic sorting: an optimal parallel algorithm for shared-memory machines
SIAM Journal on Computing
Logarithmic time cost optimal parallel sorting is not yet fast in practice!
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Efficient conditional operations for data-parallel architectures
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Parallel Sorting Algorithms
STOC '83 Proceedings of the fifteenth annual ACM symposium on Theory of computing
Photon mapping on programmable graphics hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Multicores from the Compiler's Perspective: A Blessing or a Curse?
Proceedings of the international symposium on Code generation and optimization
UberFlow: a GPU-based particle engine
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast and approximate stream mining of quantiles and frequencies using graphics processors
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Streaming architectures and technology trends
SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Sorting networks and their applications
AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
CellSort: high performance sorting on the cell processor
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
BSGP: bulk-synchronous GPU programming
ACM SIGGRAPH 2008 papers
Fast parallel GPU-sorting using a hybrid algorithm
Journal of Parallel and Distributed Computing
A Practical Quicksort Algorithm for Graphics Processors
ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Efficient implementation of sorting on multi-core SIMD CPU architecture
Proceedings of the VLDB Endowment
Optimizing the parallel computation of linear recurrences using compact matrix representations
Journal of Parallel and Distributed Computing
Data parallel acceleration of decision support queries using Cell/BE and GPUs
Proceedings of the 6th ACM conference on Computing frontiers
GPU-Quicksort: A practical Quicksort algorithm for graphics processors
Journal of Experimental Algorithmics (JEA)
A Fast and Flexible Sorting Algorithm with CUDA
ICA3PP '09 Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing
Compiler support for general-purpose computation on GPUs
The Journal of Supercomputing
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs
Proceedings of the VLDB Endowment
GPU-based island model for evolutionary algorithms
Proceedings of the 12th annual conference on Genetic and evolutionary computation
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
GPU-WAH: applying GPUs to compressing bitmap indexes with word aligned hybrid
DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Design and implementation of an efficient integer count sort in CUDA GPUs
Concurrency and Computation: Practice & Experience
Multiresolution MIP rendering of large volumetric data accelerated on graphics hardware
EUROVIS'07 Proceedings of the 9th Joint Eurographics / IEEE VGTC conference on Visualization
Hi-index | 0.00 |
In this paper, we present a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting n values utilizing p stream processor units, this approach achieves the optimal time complexity O((n log n)/p). While this makes our approach competitive with common sequential sorting algorithms not only from a theoretical viewpoint, it is also very fast from a practical viewpoint. This is achieved by using efficient linear stream memory accesses (and by combining the optimal time approach with algorithms optimized for small input sequences). We present an implementation on modern programmable graphics hardware (GPUs). On recent GPUs, our optimal parallel sorting approach has shown to be remarkably faster than sequential sorting on the CPU, and it is also faster than previous nonoptimal sorting approaches on the GPU for sufficiently large input sequences. Because of the excellent scalability of our algorithm with the number of stream processor units p (up to n/ log2 n or even n/ log n units, depending on the stream architecture), our approach profits heavily from the trend of increasing number of fragment processor units on GPUs, so that we can expect further speed improvement with upcoming GPU generations.