Designing efficient sorting algorithms for manycore GPUs

Authors:
Nadathur Satish;Mark Harris;Michael Garland
Affiliations:
Dept. of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA;NVIDIA Corporation, Santa Clara, CA, USA;NVIDIA Corporation, Santa Clara, CA, USA
Venue:
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Year:
2009

Citing 0
Cited 74

Stream compaction for deferred shading

Proceedings of the Conference on High Performance Graphics 2009
Real-time parallel hashing on the GPU

ACM SIGGRAPH Asia 2009 papers
Interactive fluid-particle simulation using translating Eulerian grids

Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games
FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects

Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games
Teaching design & analysis of multi-core parallel algorithms using CUDA

Journal of Computing Sciences in Colleges
The Scalable Heterogeneous Computing (SHOC) benchmark suite

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
High-detailed fluid simulations on the GPU

ACM SIGGRAPH 2010 Talks
Understanding throughput-oriented architectures

Communications of the ACM
Revisiting sorting for GPGPU stream architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Fast parallel surface and solid voxelization on GPUs

ACM SIGGRAPH Asia 2010 papers
Fast in-place sorting with CUDA based on bitonic sort

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
A two-level real-time vision machine combining coarse- and fine-grained parallelism

Journal of Real-Time Image Processing
HLBVH: hierarchical LBVH construction for real-time ray tracing of dynamic geometry

Proceedings of the Conference on High Performance Graphics
Particle-in-cell simulations with charge-conserving current deposition on graphic processing units

Journal of Computational Physics
OpenCL and parallel primitives for digital TV applications

IBM Journal of Research and Development
GPU curvature estimation on deformable meshes

I3D '11 Symposium on Interactive 3D Graphics and Games
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
Dynamic workload balancing deques for branch and bound algorithms in the message passing interface

International Journal of High Performance Systems Architecture
Implicit and dynamic trees for high performance rendering

Proceedings of Graphics Interface 2011
SAH KD-tree construction on GPU

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Randomized selection on the GPU

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Preliminary work on graphics processing unit based direct simulation Monte Carlo

Proceedings of the 2010 Conference on Grand Challenges in Modeling & Simulation
A GPU-Based Implementation for Range Queries on Spaghettis Data Structure

ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part I
Analysis of Multi-Sort Algorithm on Multi-Mesh of Trees (MMT) architecture

The Journal of Supercomputing
Parallelized agent-based simulation on CPU and graphics hardware for spatial and stochastic models in biology

Proceedings of the 9th International Conference on Computational Methods in Systems Biology
Real-time computation of advanced rules in OLAP databases

ADBIS'11 Proceedings of the 15th international conference on Advances in databases and information systems
Designing fast architecture-sensitive tree search on modern multicore/many-core processors

ACM Transactions on Database Systems (TODS)
Accelerating the smoldyn spatial stochastic biochemical reaction network simulator using GPUs

Proceedings of the 19th High Performance Computing Symposia
Speeding up large-scale geospatial polygon rasterization on GPGPUs

Proceedings of the ACM SIGSPATIAL Second International Workshop on High Performance and Distributed Geographic Information Systems
Parallel implementation of external sort and join operations on a multi-core network-optimized system on a chip

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Parallel quadtree coding of large-scale raster geospatial data on GPGPUs

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Scalable parallel minimum spanning forest computation

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Implementing p systems parallelism by means of GPUs

WMC'09 Proceedings of the 10th international conference on Membrane Computing
Fast GPU-Based fluid simulations using SPH

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Improving the speed and stability of the k-nearest neighbors method

Pattern Recognition Letters
Design and implementation of an efficient integer count sort in CUDA GPUs

Concurrency and Computation: Practice & Experience
A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Software—Practice & Experience
Graphics processing unit based direct simulation Monte Carlo

Simulation
Thermal management of a many-core processor under fine-grained parallelism

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal

Proceedings of the 15th International Conference on Extending Database Technology
GPU merge path: a GPU merging algorithm

Proceedings of the 26th ACM international conference on Supercomputing
GPU Performance Enhancement via Communication Cost Reduction: Case Studies of Radix Sort and WSN Relay Node Placement Problem

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Fast GPU perspective grid construction and triangle tracing for exhaustive ray tracing of highly coherent rays

International Journal of High Performance Computing Applications
Discrete range searching primitive for the GPU and its applications

Journal of Experimental Algorithmics (JEA)
GPGPU implementation of growing neural gas: Application to 3D scene reconstruction

Journal of Parallel and Distributed Computing
Interactive global photon mapping

EGSR'09 Proceedings of the Twentieth Eurographics conference on Rendering
Ray tracing dynamic scenes with shadows on GPU

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Approximate parallel sorting on a spatial computer

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Parallel suffix array construction for shared memory architectures

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
RDFS reasoning on massively parallel hardware

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

International Journal of Grid and High Performance Computing
Parallel suffix array and least common prefix for the GPU

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
StreamScan: fast scan algorithms for GPUs without global barrier synchronization

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Comparison based sorting for systems with multiple GPUs

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Designing a database system for modern processing architectures

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Optimising purely functional GPU programs

Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

The Journal of Supercomputing
Evaluating the acceleration of typical scientific problems on the GPU

Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
A micro 64-tree structure for accelerating ray tracing on a GPU

Proceedings of Graphics Interface 2013
Towards accelerating smoothed particle hydrodynamics simulations for free-surface flows on multi-GPU clusters

Journal of Parallel and Distributed Computing
Interactive smoke simulation and rendering on the GPU

Proceedings of the 12th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and Its Applications in Industry
Register level sort algorithm on multi-core SIMD processors

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
A sound and complete abstraction for reasoning about parallel prefix sums

Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
Hardware-oblivious parallelism for in-memory column-stores

Proceedings of the VLDB Endowment
Red Fox: An Execution Environment for Relational Query Processing on GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hardware acceleration of database operations

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Theoretical analysis of classic algorithms on highly-threaded many-core GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Time- and space-efficient flow-sensitive points-to analysis

ACM Transactions on Architecture and Code Optimization (TACO)
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
GPU-based parallel construction of compact visual hull meshes

The Visual Computer: International Journal of Computer Graphics
Sorted deferred shading for production path tracing

EGSR '13 Proceedings of the Eurographics Symposium on Rendering

Quantified Score

Hi-index	0.02

Visualization

Abstract

We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA's GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors.