Scalable Parallel Programming with CUDA

Authors:
John Nickolls;Ian Buck;Michael Garland;Kevin Skadron
Affiliations:
NVIDIA;NVIDIA;NVIDIA;University of Virginia
Venue:
Queue - GPU Computing
Year:
2008

Citing 4
Cited 148

Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Concurrent number cruncher: an efficient sparse linear solver on the GPU

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications

Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Sparse matrix computations on manycore GPU's

Proceedings of the 45th annual Design Automation Conference
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Multilevel summation of electrostatic potentials using graphics processing units

Parallel Computing
High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
High-performance SIMT code generation in an active visual effects library

Proceedings of the 6th ACM conference on Computing frontiers
High-performance regular expression scanning on the Cell/B.E. processor

Proceedings of the 23rd international conference on Supercomputing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Experiences with Mapping Non-linear Memory Access Patterns into GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Probing biomolecular machines with graphics processors

Communications of the ACM - A View of Parallel Computing
COMPASS: A Community-driven Parallelization Advisor for Sequential Software

IWMSE '09 Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering
A parallel algorithm for construction of uniform grids

Proceedings of the Conference on High Performance Graphics 2009
JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Ray casting of multiple volumetric datasets with polyhedral boundaries on manycore GPUs

ACM SIGGRAPH Asia 2009 papers
Probing Biomolecular Machines with Graphics Processors

Queue - Bioscience
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Efficient band approximation of Gram matrices for large scale kernel methods on GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Teaching design & analysis of multi-core parallel algorithms using CUDA

Journal of Computing Sciences in Colleges
Parallel multiclass classification using SVMs on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Massively parallel forward modeling of scalar and tensor gravimetry data

Computers & Geosciences
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
JCudaMP: OpenMP/Java on CUDA

Proceedings of the 3rd International Workshop on Multicore Software Engineering
Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization

Proceedings of the 24th ACM International Conference on Supercomputing
Comparative analysis of data mining techniques for financial data using parallel processing

Proceedings of the 7th International Conference on Frontiers of Information Technology
Understanding throughput-oriented architectures

Communications of the ACM
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
PacketShader: a GPU-accelerated software router

Proceedings of the ACM SIGCOMM 2010 conference
Multi-GPU volume rendering using MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units

Pattern Recognition Letters
A GPU-Based Application Framework Supporting Fast Discrete-Event Simulation

Simulation
Parallel processing on NVIDIA graphics processing units using CUDA

Journal of Computing Sciences in Colleges
Learning CUDA: lab exercises and experiences

Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
Meta-simulation of large WSN on multi-core computers

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Parallel implementation of conjugate gradient method on graphics processors

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
GPU-supported object tracking using adaptive appearance models and particle swarm optimization

ICCVG'10 Proceedings of the 2010 international conference on Computer vision and graphics: Part II
Optimizing memory access on GPUs using morton order indexing

Proceedings of the 48th Annual Southeast Regional Conference
HLBVH: hierarchical LBVH construction for real-time ray tracing of dynamic geometry

Proceedings of the Conference on High Performance Graphics
An analysis of queuing network simulation using GPU-based hardware acceleration

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Development of a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI)

Journal of Computational Physics
Throughput-Effective On-Chip Networks for Manycore Accelerators

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Copperhead: compiling an embedded data parallel language

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Translation-invariant two-dimensional discrete wavelet transform on graphics processing units

ECS'10/ECCTD'10/ECCOM'10/ECCS'10 Proceedings of the European conference of systems, and European conference of circuits technology and devices, and European conference of communications, and European conference on Computer science
CnC-CUDA: declarative programming for GPUs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
GPU-based fast motion estimation for on-the-fly encoding of computer-generated video streams

Proceedings of the 21st international workshop on Network and operating systems support for digital audio and video
Structuring the unstructured middle with chunk computing

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Fluid-structure coupling using lattice-Boltzmann and fixed-grid FEM

Finite Elements in Analysis and Design
MDR: performance model driven runtime for heterogeneous parallel platforms

Proceedings of the international conference on Supercomputing
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators

Proceedings of the 38th annual international symposium on Computer architecture
High performance content-based matching using GPUs

Proceedings of the 5th ACM international conference on Distributed event-based system
Simpler and faster HLBVH with work queues

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
SAH KD-tree construction on GPU

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
VoxelPipe: a programmable pipeline for 3D voxelization

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Rapid simplification of multi-attribute meshes

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Towards paradisEO-MO-GPU: a framework for GPU-based local search metaheuristics

IWANN'11 Proceedings of the 11th international conference on Artificial neural networks conference on Advances in computational intelligence - Volume Part I
Solving a kind of boundary-value problem for ordinary differential equations using Fermi-The next generation CUDA computing architecture

Journal of Computational and Applied Mathematics
Case studies in automatic GPGPU code generation with llc

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Parameter optimisation in the receptor density algorithm

ICARIS'11 Proceedings of the 10th international conference on Artificial immune systems
Iterative sparse Matrix-Vector multiplication for integer factorization on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Memory access optimization in recurrent image processing algorithms with CUDA

Pattern Recognition and Image Analysis
A parallel implementation of the thresholding problem by using tissue-like P systems

CAIP'11 Proceedings of the 14th international conference on Computer analysis of images and patterns - Volume Part II
Trasgo: a nested-parallel programming system

The Journal of Supercomputing
High Performance Hybrid Functional Petri Net Simulations of Biological Pathway Models on CUDA

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
CUDA-BLASTP: Accelerating BLASTP on CUDA-Enabled Graphics Hardware

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Liszt: a domain specific language for building portable mesh-based PDE solvers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Performance evaluation of the three-dimensional finite-difference time-domain(FDTD) method on Fermi architecture GPUs

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Design and implementation of seeds dispersion on graphic processor unit

Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry
Using explicit platform descriptions to support programming of heterogeneous many-core systems

Parallel Computing
Geospatial overlay computation on the GPU

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Optimization strategies in different CUDA architectures using llCoMP

Microprocessors & Microsystems
Safe and familiar multi-core programming by means of a hybrid functional and imperative language

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Hardware transactional memory for GPU architectures

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Implementing p systems parallelism by means of GPUs

WMC'09 Proceedings of the 10th international conference on Membrane Computing
GPU-Based multi-start local search algorithms

LION'05 Proceedings of the 5th international conference on Learning and Intelligent Optimization
Inverse kinematics solution for robotic manipulators using a CUDA-Based parallel genetic algorithm

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Implementing a GPU programming model on a Non-GPU accelerator architecture

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Towards efficient execution of erasure codes on multicore architectures

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Smoldyn on Graphics Processing Units: Massively Parallel Brownian Dynamics Simulations

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Efficient parallel CKY parsing on GPUs

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Parallel preconditioned conjugate gradient algorithm on GPU

Journal of Computational and Applied Mathematics
A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Software—Practice & Experience
Mapping a data-flow programming model onto heterogeneous platforms

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
FORMLESS: scalable utilization of embedded manycores in streaming applications

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Time warp on the go

Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures

Proceedings of the 39th Annual International Symposium on Computer Architecture
GPU-based parallel algorithms for sparse nonlinear systems

Journal of Parallel and Distributed Computing
CUDAICA: GPU optimization of infomax-ICA EEG analysis

Computational Intelligence and Neuroscience - Special issue on Advanced Computational Techniques and Tools for Neuroscience
Accelerating the red/black SOR method using GPUs with CUDA

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Muppet: MapReduce-style processing of fast data

Proceedings of the VLDB Endowment
Performance evaluation of hybrid implementation of support vector machine

IDEAL'12 Proceedings of the 13th international conference on Intelligent Data Engineering and Automated Learning
Ray tracing dynamic scenes with shadows on GPU

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallel solution of the subset-sum problem: an empirical study

Concurrency and Computation: Practice & Experience
Efficient data management for incoherent ray tracing

Applied Soft Computing
A VM-aware fairness scheduler on heterogenous multi-core platforms

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Artificial Neural Network Simulation on CUDA

DS-RT '12 Proceedings of the 2012 IEEE/ACM 16th International Symposium on Distributed Simulation and Real Time Applications
Exploring alternative flexible OpenCL (FlexCL) core designs in FPGA-based MPSoC systems

Proceedings of the 2013 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

International Journal of Grid and High Performance Computing
A Simple Compressive Sensing Algorithm for Parallel Many-Core Architectures

Journal of Signal Processing Systems
Grex: An efficient MapReduce framework for graphics processing units

Journal of Parallel and Distributed Computing
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Parallel strategies for 2D Discrete Wavelet Transform in shared memory systems and GPUs

The Journal of Supercomputing
Accelerating universal Kriging interpolation algorithm using CUDA-enabled GPU

Computers & Geosciences
Segmenting images with gradient-based edge detection using Membrane Computing

Pattern Recognition Letters
Fairness scheduler for virtual machines on heterogonous multi-core platforms

ACM SIGAPP Applied Computing Review
Examples from computational geometry that demonstrate the potential of using the thrust library to implement parallel processing on GPUs

Journal of Computing Sciences in Colleges
Speeding up model building for ECGA on CUDA platform

Proceedings of the 15th annual conference on Genetic and evolutionary computation
ParadisEO-MO-GPU: a framework for parallel GPU-based local search metaheuristics

Proceedings of the 15th annual conference on Genetic and evolutionary computation
GPU acceleration of regular expression matching for large datasets: exploring the implementation space

Proceedings of the ACM International Conference on Computing Frontiers
Microarchitectural mechanisms to exploit value structure in SIMT architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

ACM Transactions on Computer Systems (TOCS)
PixelPie: maximal Poisson-disk sampling with rasterization

Proceedings of the 5th High-Performance Graphics Conference
Optimising lossless stages in a GPU-based MPEG encoder

Multimedia Tools and Applications
Fast 3D wavelet transform on multicore and many-core computing platforms

The Journal of Supercomputing
A preliminary evaluation of OpenACC implementations

The Journal of Supercomputing
Progress towards accelerating HOMME on hybrid multi-core systems

International Journal of High Performance Computing Applications
Designing on-chip networks for throughput accelerators

ACM Transactions on Architecture and Code Optimization (TACO)
GPU acceleration of the WSM6 cloud microphysics scheme in GRAPES model

Computers & Geosciences
Scheduling concurrent applications on a cluster of CPU-GPU nodes

Future Generation Computer Systems
Energy efficient GPU transactional memory via space-time optimizations

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Restoring surfaces after removing objects in indoor 3D point clouds

Proceedings of the Fourth Symposium on Information and Communication Technology
A decomposition for in-place matrix transposition

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Portable, MPI-interoperable coarray fortran

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A GPU accelerated algorithm for 3D Delaunay triangulation

Proceedings of the 18th meeting of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games
Extending a distributed virtual reality system with exchangeable rendering back-ends

The Visual Computer: International Journal of Computer Graphics
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

Parallel Computing
Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU

Journal of Parallel and Distributed Computing
Writing scalable SIMD programs with ISPC

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Frequency-based re-sequencing tool for short reads on graphics processing units

International Journal of Computational Science and Engineering
Efficient parallel algorithm for multiple sequence alignments with regular expression constraints on graphics processing units

International Journal of Computational Science and Engineering
Motion vector extrapolation for parallel motion estimation on GPU

Multimedia Tools and Applications
An efficient parallelization technique for x264 encoder on heterogeneous platforms consisting of CPUs and GPUs

Journal of Real-Time Image Processing

Quantified Score

Hi-index	0.03

Visualization

Abstract

The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.