Revisiting sorting for GPGPU stream architectures

Authors:
Duane G. Merrill;Andrew S. Grimshaw
Affiliations:
University of Virginia, Charlottesville, VA, USA;University of Virginia, Charlottesville, VA, USA
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 5
Cited 9

Design patterns: elements of reusable object-oriented software

Design patterns: elements of reusable object-oriented software
Introduction to Algorithms

Introduction to Algorithms
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

CudaDMA: optimizing GPU memory bandwidth via warp specialization

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Efficient probabilistic and geometric anatomical mapping using particle mesh approximation on GPUs

Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
Speeding up large-scale geospatial polygon rasterization on GPGPUs

Proceedings of the ACM SIGSPATIAL Second International Workshop on High Performance and Distributed Geographic Information Systems
Fast GPU-based locality sensitive hashing for k-nearest neighbor computation

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Design and implementation of an efficient integer count sort in CUDA GPUs

Concurrency and Computation: Practice & Experience
Fast GPU perspective grid construction and triangle tracing for exhaustive ray tracing of highly coherent rays

International Journal of High Performance Computing Applications
Maximizing parallelism in the construction of BVHs, octrees, and k-d trees

EGGH-HPG'12 Proceedings of the Fourth ACM SIGGRAPH / Eurographics conference on High-Performance Graphics
OpenCL C++

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Triggered instructions: a control paradigm for spatially-programmed architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.02

Visualization

Abstract

This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).