GPU-ABiSort: optimal parallel sorting on stream architectures

Authors:
Alexander Greß;Gabriel Zachmann
Affiliations:
Institute of Computer Science II, Rhein. Friedr.-Wilh.-Universität Bonn, Bonn, Germany;Institute of Computer Science, Clausthal University of Technology, Clausthal, Germany
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 14
Cited 16

Efficient parallel algorithms

Efficient parallel algorithms
Parallel merge sort

SIAM Journal on Computing
Adaptive bitonic sorting: an optimal parallel algorithm for shared-memory machines

SIAM Journal on Computing
Logarithmic time cost optimal parallel sorting is not yet fast in practice!

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Efficient conditional operations for data-parallel architectures

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Parallel Sorting Algorithms

Parallel Sorting Algorithms
An 0(n log n) sorting network

STOC '83 Proceedings of the fifteenth annual ACM symposium on Theory of computing
Photon mapping on programmable graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Multicores from the Compiler's Perspective: A Blessing or a Curse?

Proceedings of the international symposium on Code generation and optimization
UberFlow: a GPU-based particle engine

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast and approximate stream mining of quantiles and frequencies using graphics processors

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Streaming architectures and technology trends

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Sorting networks and their applications

AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference

CellSort: high performance sorting on the cell processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
BSGP: bulk-synchronous GPU programming

ACM SIGGRAPH 2008 papers
Fast parallel GPU-sorting using a hybrid algorithm

Journal of Parallel and Distributed Computing
A Practical Quicksort Algorithm for Graphics Processors

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
Optimizing the parallel computation of linear recurrences using compact matrix representations

Journal of Parallel and Distributed Computing
Data parallel acceleration of decision support queries using Cell/BE and GPUs

Proceedings of the 6th ACM conference on Computing frontiers
GPU-Quicksort: A practical Quicksort algorithm for graphics processors

Journal of Experimental Algorithmics (JEA)
A Fast and Flexible Sorting Algorithm with CUDA

ICA3PP '09 Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing
Compiler support for general-purpose computation on GPUs

The Journal of Supercomputing
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment
GPU-based island model for evolutionary algorithms

Proceedings of the 12th annual conference on Genetic and evolutionary computation
Fast in-place sorting with CUDA based on bitonic sort

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
GPU-WAH: applying GPUs to compressing bitmap indexes with word aligned hybrid

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Design and implementation of an efficient integer count sort in CUDA GPUs

Concurrency and Computation: Practice & Experience
Multiresolution MIP rendering of large volumetric data accelerated on graphics hardware

EUROVIS'07 Proceedings of the 9th Joint Eurographics / IEEE VGTC conference on Visualization

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting n values utilizing p stream processor units, this approach achieves the optimal time complexity O((n log n)/p). While this makes our approach competitive with common sequential sorting algorithms not only from a theoretical viewpoint, it is also very fast from a practical viewpoint. This is achieved by using efficient linear stream memory accesses (and by combining the optimal time approach with algorithms optimized for small input sequences). We present an implementation on modern programmable graphics hardware (GPUs). On recent GPUs, our optimal parallel sorting approach has shown to be remarkably faster than sequential sorting on the CPU, and it is also faster than previous nonoptimal sorting approaches on the GPU for sufficiently large input sequences. Because of the excellent scalability of our algorithm with the number of stream processor units p (up to n/ log2 n or even n/ log n units, depending on the stream architecture), our approach profits heavily from the trend of increasing number of fragment processor units on GPUs, so that we can expect further speed improvement with upcoming GPU generations.