Parallel Computing Experiences with CUDA

Authors:
Michael Garland;Scott Le Grand;John Nickolls;Joshua Anderson;Jim Hardwick;Scott Morton;Everett Phillips;Yao Zhang;Vasily Volkov
Affiliations:
NVIDIA;NVIDIA;NVIDIA;Iowa State University and Ames Laboratory;TechniScan Medical Systems;Hess;University of California, Davis;University of California, Davis;University of California, Berkeley
Venue:
IEEE Micro
Year:
2008

Citing 0
Cited 39

Controlling chaos: on safe side-effects in data-parallel operations

Proceedings of the 4th workshop on Declarative aspects of multicore programming
Experiences with Mapping Non-linear Memory Access Patterns into GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling

Proceedings of the 31st DAGM Symposium on Pattern Recognition
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Algorithm/architecture co-exploration of visual computing on emergent platforms: overview and future prospects

IEEE Transactions on Circuits and Systems for Video Technology
Understanding throughput-oriented architectures

Communications of the ACM
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the International Space Station

Journal of Real-Time Image Processing
Parallel processing with CUDA in ceramic tiles classification

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I
Optimizing memory access on GPUs using morton order indexing

Proceedings of the 48th Annual Southeast Regional Conference
High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Parallelizing compiler framework and API for power reduction and software productivity of real-time heterogeneous multicores

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Assessment of GPU computational enhancement to a 2D flood model

Environmental Modelling & Software
MPI-CUDA parallelization of a finite-strip program for geometric nonlinear analysis: A hybrid approach

Advances in Engineering Software
Operating systems must support GPU abstractions

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Granular representation of temporal signals using differential quadratures

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part II
Parallel multivariate slice sampling

Statistics and Computing
PTask: operating system abstractions to manage GPUs as compute devices

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
GPU accelerated CAE using open solvers and the cloud

ACM SIGARCH Computer Architecture News
A GPU-based high-throughput image retrieval algorithm

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Implementing p systems parallelism by means of GPUs

WMC'09 Proceedings of the 10th international conference on Membrane Computing
Towards user transparent parallel multimedia computing on GPU-Clusters

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Parallelization of pagerank on multicore processors

ICDCIT'12 Proceedings of the 8th international conference on Distributed Computing and Internet Technology
GPU join processing revisited

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
On the correctness of the SIMT execution model of GPUs

ESOP'12 Proceedings of the 21st European conference on Programming Languages and Systems
Direct approaches to exploit many-core architecture in bioinformatics

Future Generation Computer Systems
Three-dimensional thinning algorithms on graphics processing units and multicore CPUs

Concurrency and Computation: Practice & Experience
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Enhancing data parallelism for Ant Colony Optimization on GPUs

Journal of Parallel and Distributed Computing
Spill code placement for SIMD machines

SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Parallel partitioning for distributed systems using sequential assignment

Journal of Parallel and Distributed Computing
Data layout optimization for GPGPU architectures

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Enhancing GPU parallelism in nature-inspired algorithms

The Journal of Supercomputing
Parallel multi-objective Ant Programming for classification using GPUs

Journal of Parallel and Distributed Computing
Real-time recovery of moving 3D faces for emerging applications

Computers in Industry
Parallel evaluation of Pittsburgh rule-based classifiers on GPUs

Neurocomputing
High level transforms for SIMD and low-level computer vision algorithms

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Population-based harmony search using GPU applied to protein structure prediction

International Journal of Computational Science and Engineering
High performance evaluation of evolutionary-mined association rules on GPUs

The Journal of Supercomputing

Quantified Score

Hi-index	0.02

Visualization

Abstract

The CUDA programming model provides a straightforward means of describing inherently parallel computations, and NVIDIA's Tesla GPU architecture delivers high computational throughput on massively parallel problems. This article surveys experiences gained in applying CUDA to a diverse set of problems and the parallel speedups over sequential codes running on traditional CPU architectures attained by executing key computations on the GPU.