Scan primitives for GPU computing

Authors:
Shubhabrata Sengupta;Mark Harris;Yao Zhang;John D. Owens
Affiliations:
University of California, Davis;NVIDIA Corporation;University of California, Davis;University of California, Davis
Venue:
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Year:
2007

Citing 13
Cited 118

Solving sparse triangular linear systems on parallel computers

International Journal of High Speed Computing
Vector models for data-parallel computing

Vector models for data-parallel computing
Rapid, stable fluid dynamics for computer graphics

SIGGRAPH '90 Proceedings of the 17th annual conference on Computer graphics and interactive techniques
Scan primitives for vector computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Ultracomputers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors

Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
A programming language

A programming language
Glift: Generic, efficient, random-access GPU data structures

ACM Transactions on Graphics (TOG)
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A performance-oriented data parallel virtual machine for GPUs

ACM SIGGRAPH 2006 Sketches

Resolution-matched shadow maps

ACM Transactions on Graphics (TOG)
Scout: a data-parallel programming language for graphics processors

Parallel Computing
BSGP: bulk-synchronous GPU programming

ACM SIGGRAPH 2008 papers
Efficient gather and scatter operations on graphics processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performing efficient NURBS modeling operations on the GPU

Proceedings of the 2008 ACM symposium on Solid and physical modeling
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Fast scan algorithms on graphics processors

Proceedings of the 22nd annual international conference on Supercomputing
Relational joins on graphics processors

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Sparse matrix computations on manycore GPU's

Proceedings of the 45th annual Design Automation Conference
Data parallel execution challenges and runtime performance of agent simulations on GPUs

Proceedings of the 2008 Spring simulation multiconference
Scalable parallel programming with CUDA

ACM SIGGRAPH 2008 classes
Real-time KD-tree construction on graphics hardware

ACM SIGGRAPH Asia 2008 papers
Real-time Reyes-style adaptive surface subdivision

ACM SIGGRAPH Asia 2008 papers
Algorithmic performance studies on graphics processing units

Journal of Parallel and Distributed Computing
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Fast parallel GPU-sorting using a hybrid algorithm

Journal of Parallel and Distributed Computing
All-pairs shortest-paths for large graphs on the GPU

Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU for Machine Learning

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
A Practical Quicksort Algorithm for Graphics Processors

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
Rapid Multipole Graph Drawing on the GPU

Graph Drawing
Fast high-quality line visibility

Proceedings of the 2009 symposium on Interactive 3D graphics and games
Real-time view-dependent rendering of parametric surfaces

Proceedings of the 2009 symposium on Interactive 3D graphics and games
Fast and scalable list ranking on the GPU

Proceedings of the 23rd international conference on Supercomputing
On sorting and load balancing on GPUs

ACM SIGARCH Computer Architecture News
Solving Sparse Linear Systems on NVIDIA Tesla GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
GPU-Quicksort: A practical Quicksort algorithm for graphics processors

Journal of Experimental Algorithmics (JEA)
Frequent itemset mining on graphics processors

Proceedings of the Fifth International Workshop on Data Management on New Hardware
A parallel algorithm for construction of uniform grids

Proceedings of the Conference on High Performance Graphics 2009
Parallel view-dependent tessellation of Catmull-Clark subdivision surfaces

Proceedings of the Conference on High Performance Graphics 2009
Efficient stream compaction on wide SIMD many-core architectures

Proceedings of the Conference on High Performance Graphics 2009
Fast minimum spanning tree for large graphs on the GPU

Proceedings of the Conference on High Performance Graphics 2009
Stream compaction for deferred shading

Proceedings of the Conference on High Performance Graphics 2009
Real-time parallel hashing on the GPU

ACM SIGGRAPH Asia 2009 papers
Relational query coprocessing on graphics processors

ACM Transactions on Database Systems (TODS)
Accelerating geometric queries using the GPU

2009 SIAM/ACM Joint Conference on Geometric and Physical Modeling
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Efficient band approximation of Gram matrices for large scale kernel methods on GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects

Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games
Parallel Banding Algorithm to compute exact distance transform with the GPU

Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games
The Scalable Heterogeneous Computing (SHOC) benchmark suite

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Accelerating MATLAB Image Processing Toolbox functions on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Acceleration of the Smith-Waterman algorithm using single and multiple graphics processors

Journal of Computational Physics
State-of-the-art in heterogeneous computing

Scientific Programming
Solving path problems on the GPU

Parallel Computing
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Technical Section: Parallel generation of multiple L-systems

Computers and Graphics
Real-time collision culling of a million bodies on graphics processing units

ACM SIGGRAPH Asia 2010 papers
Parallel implementation of conjugate gradient method on graphics processors

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Fast in-place sorting with CUDA based on bitonic sort

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
A fast GPU implementation for solving sparse ill-posed linear equation systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
GPU-WAH: applying GPUs to compressing bitmap indexes with word aligned hybrid

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
HLBVH: hierarchical LBVH construction for real-time ray tracing of dynamic geometry

Proceedings of the Conference on High Performance Graphics
A work-efficient GPU algorithm for level set segmentation

Proceedings of the Conference on High Performance Graphics
Accelerating Haskell array codes with multicore GPUs

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Simple optimizations for an applicative array language for graphics processors

Proceedings of the sixth workshop on Declarative aspects of multicore programming
FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Fast random graph generation

Proceedings of the 14th International Conference on Extending Database Technology
Analysis of Parallel Algorithms for Energy Conservation with GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Register packing for cyclic reduction: a case study

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Efficient maximal poisson-disk sampling

ACM SIGGRAPH 2011 papers
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
Parallel programming with inductive synthesis

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Efficient parallel lists intersection and index compression algorithms using graphics processing units

Proceedings of the VLDB Endowment
Improving SIMD efficiency for parallel Monte Carlo light transport on the GPU

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
SAH KD-tree construction on GPU

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Robust real-time deformation of incompressible surface meshes

SCA '11 Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation
GPU-efficient recursive filtering and summed-area tables

Proceedings of the 2011 SIGGRAPH Asia Conference
Solving a kind of boundary-value problem for ordinary differential equations using Fermi-The next generation CUDA computing architecture

Journal of Computational and Applied Mathematics
MOLAP cube based on parallel scan algorithm

ADBIS'11 Proceedings of the 15th international conference on Advances in databases and information systems
GPU-accelerated Hausdorff distance computation between dynamic deformable NURBS surfaces

Computer-Aided Design
High Performance Hybrid Functional Petri Net Simulations of Biological Pathway Models on CUDA

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Two-Way Real Time Fluid Simulation Using a Heterogeneous Multicore CPU and GPU Architecture

PADS '11 Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast GPU-based locality sensitive hashing for k-nearest neighbor computation

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Scalable parallel minimum spanning forest computation

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs

Microprocessors & Microsystems
Many-Core architecture oriented parallel algorithm design for computer animation

MIG'11 Proceedings of the 4th international conference on Motion in Games
Continuous deformations by isometry preserving shape integration

Proceedings of the 7th international conference on Curves and Surfaces
Smoldyn on Graphics Processing Units: Massively Parallel Brownian Dynamics Simulations

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Scan detection and parallelization in "inherently sequential" nested loop programs

Proceedings of the Tenth International Symposium on Code Generation and Optimization
GPU Performance Enhancement via Communication Cost Reduction: Case Studies of Radix Sort and WSN Relay Node Placement Problem

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Fast GPU perspective grid construction and triangle tracing for exhaustive ray tracing of highly coherent rays

International Journal of High Performance Computing Applications
Sorting on GPUs for large scale datasets: A thorough comparison

Information Processing and Management: an International Journal
Constructing natural neighbor interpolation based grid DEM using CUDA

Proceedings of the 3rd International Conference on Computing for Geospatial Research and Applications
Discrete range searching primitive for the GPU and its applications

Journal of Experimental Algorithmics (JEA)
Parallel algorithm for landform attributes representation on multicore and Multi-GPU systems

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Nested data-parallelism on the gpu

Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
CUDASA: compute unified device and systems architecture

EG PGV'08 Proceedings of the 8th Eurographics conference on Parallel Graphics and Visualization
Ray tracing dynamic scenes with shadows on GPU

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Parallel view-dependent refinement of compact progressive meshes

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
A scalable, numerically stable, high-performance tridiagonal solver using GPUs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dependency-Free Parallel Progressive Meshes

Computer Graphics Forum
Efficient data management for incoherent ray tracing

Applied Soft Computing
RDFS reasoning on massively parallel hardware

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
An effective and efficient parallel approach for random graph generation over GPUs

Journal of Parallel and Distributed Computing
GPU-accelerated preconditioned iterative linear solvers

The Journal of Supercomputing
GPU accelerated likelihoods for stereo-based articulated tracking

ECCV'10 Proceedings of the 11th European conference on Trends and Topics in Computer Vision - Volume Part II
From multiple views to textured 3d meshes: a GPU-Powered approach

ECCV'10 Proceedings of the 11th European conference on Trends and Topics in Computer Vision - Volume Part II
Accelerating visual categorization with the GPU

ECCV'10 Proceedings of the 11th European conference on Trends and Topics in Computer Vision - Volume Part II
Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

International Journal of Grid and High Performance Computing
Data-only flattening for nested data parallelism

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Energy cost evaluation of parallel algorithms for multiprocessor systems

Cluster Computing
Fast poisson solvers for graphics processing units

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Optimising purely functional GPU programs

Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
A micro 64-tree structure for accelerating ray tracing on a GPU

Proceedings of Graphics Interface 2013
A sound and complete abstraction for reasoning about parallel prefix sums

Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
Hardware-oblivious parallelism for in-memory column-stores

Proceedings of the VLDB Endowment
Exploiting heterogeneous parallelism with the Heterogeneous Programming Library

Journal of Parallel and Distributed Computing
Data-parallel finite-state machines

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API. Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.