Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Authors:
Shane Ryoo;Christopher I. Rodrigues;Sara S. Baghsorkhi;Sam S. Stone;David B. Kirk;Wen-mei W. Hwu
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;NVIDIA Corporation, Santa Clara, CA, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Year:
2008

Citing 15
Cited 157

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Fortran 90 handbook: complete ANSI/ISO reference

Fortran 90 handbook: complete ANSI/ISO reference
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Algorithms and Theory of Computation Handbook

Algorithms and Theory of Computation Handbook
High Performance Fortran

IEEE Parallel & Distributed Technology: Systems & Technology
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Improving register allocation for subscripted variables

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Performance evaluation of GPUs using the RapidMind development platform

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation

Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Accelerating advanced mri reconstructions on gpus

Proceedings of the 5th conference on Computing frontiers
GPU acceleration of cutoff pair potentials for molecular modeling applications

Proceedings of the 5th conference on Computing frontiers
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
CUBA: an architecture for efficient CPU/co-processor data communication

Proceedings of the 22nd annual international conference on Supercomputing
Accelerating advanced MRI reconstructions on GPUs

Journal of Parallel and Distributed Computing
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Systematic Parallelization of Medical Image Reconstruction for Graphics Hardware

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Predictive Runtime Code Scheduling for Heterogeneous Architectures

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms

Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design
Accuracy and performance of graphics processors: A Quantum Monte Carlo application case study

Parallel Computing
Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
hiCUDA: a high-level directive-based language for GPU programming

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Accelerating PQMRCGSTAB algorithm on GPU

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Quantitative analysis of sequence alignment applications on multiprocessor architectures

Proceedings of the 6th ACM conference on Computing frontiers
A control-structure splitting optimization for GPGPU

Proceedings of the 6th ACM conference on Computing frontiers
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU

Proceedings of the 23rd international conference on Supercomputing
Push-assisted migration of real-time tasks in multi-core processors

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Synergistic execution of stream programs on multicores with accelerators

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Programming model for a heterogeneous x86 platform

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Practical Random Linear Network Coding on GPUs

NETWORKING '09 Proceedings of the 8th International IFIP-TC 6 Networking Conference
Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Parallel Medical Image Reconstruction: From Graphics Processors to Grids

PaCT '09 Proceedings of the 10th International Conference on Parallel Computing Technologies
On GPU's viability as a middleware accelerator

Cluster Computing
Accelerating MR image reconstruction on GPUs

ISBI'09 Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging: From Nano to Macro
Stream-centric stereo matching and view synthesis: a high-speed approach on GPUs

IEEE Transactions on Circuits and Systems for Video Technology
Exploring NVIDIA-CUDA for video coding

MMSys '10 Proceedings of the first annual ACM SIGMM conference on Multimedia systems
Cortical architectures on a GPGPU

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Massively parallel forward modeling of scalar and tensor gravimetry data

Computers & Geosciences
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Solving path problems on the GPU

Parallel Computing
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters

Proceedings of the 24th ACM International Conference on Supercomputing
Speeding up homomorpic hashing using GPUs

ICC'09 Proceedings of the 2009 IEEE international conference on Communications
Efficient design and implementation of visual computing algorithms on the GPU

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
PacketShader: a GPU-accelerated software router

Proceedings of the ACM SIGCOMM 2010 conference
The case for crowd computing

Proceedings of the second ACM SIGCOMM workshop on Networking, systems, and applications on mobile handhelds
A GPU accelerated storage system

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Data layout transformation exploiting memory-level parallelism in structured grid many-core applications

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Hera-JVM: abstracting processor heterogeneity behind a virtual machine

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Accelerating iterative field-compensated MR image reconstruction on GPUS

ISBI'10 Proceedings of the 2010 IEEE international conference on Biomedical imaging: from nano to Macro
Hera-JVM: a runtime system for heterogeneous multi-core architectures

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Distributed stream processing with DUP

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
memCUDA: map device memory to host memory on GPGPU platform

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing memory access on GPUs using morton order indexing

Proceedings of the 48th Annual Southeast Regional Conference
Software-based branch predication for AMD GPUs

ACM SIGARCH Computer Architecture News
Compact data structure and scalable algorithms for the sparse grid technique

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
BehaveRT: a GPU-based library for autonomous characters

MIG'10 Proceedings of the Third international conference on Motion in games
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Power and Performance Characterization of Computational Kernels on the GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Analyzing program flow within a many-kernel OpenCL application

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Global memory access modelling for efficient implementation of the lattice Boltzmann method on graphics processing units

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Optimizing and auto-tuning belief propagation on the GPU

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A code-based analytical approach for using separate device coprocessors in computing systems

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Importance of explicit vectorization for CPU and GPU software performance

Journal of Computational Physics
A static task partitioning approach for heterogeneous systems using OpenCL

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automating GPU computing in MATLAB

Proceedings of the international conference on Supercomputing
Shadowfax: scaling in heterogeneous cluster systems via GPGPU assemblies

Proceedings of the 5th international workshop on Virtualization technologies in distributed computing
A new approach to the lattice Boltzmann method for graphics processing units

Computers & Mathematics with Applications
Automatic abstraction and fault tolerance in cortical microachitectures

Proceedings of the 38th annual international symposium on Computer architecture
Parallel programming with inductive synthesis

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Pegasus: coordinated scheduling for virtualized accelerator-based systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CuMAPz: a tool to analyze memory access patterns in CUDA

Proceedings of the 48th Design Automation Conference
On the GPGPU parallelization issues of finite element approximate inverse preconditioning

Journal of Computational and Applied Mathematics
Model-driven tile size selection for DOACROSS loops on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Parallel medical image reconstruction: from graphics processing units (GPU) to Grids

The Journal of Supercomputing
PTask: operating system abstractions to manage GPUs as compute devices

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast implementation of DGEMM on Fermi GPU

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

The Journal of Supercomputing
Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Optimization strategies in different CUDA architectures using llCoMP

Microprocessors & Microsystems
Solving classification problems using genetic programming algorithms on GPUs

HAIS'10 Proceedings of the 5th international conference on Hybrid Artificial Intelligence Systems - Volume Part II
Improving GPU performance via large warps and two-level warp scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Auto-tuning interactive ray tracing using an analytical GPU architecture model

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Implementing p systems parallelism by means of GPUs

WMC'09 Proceedings of the 10th international conference on Membrane Computing
Optimizing stencil application on multi-thread GPU architecture using stream programming model

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
A fast GPU-based motion estimation algorithm for H.264/AVC

MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
A unified optimizing compiler framework for different GPGPU architectures

ACM Transactions on Architecture and Code Optimization (TACO)
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

Computer Methods and Programs in Biomedicine
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization
A GPU-Based accelerator for chinese word segmentation

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation

Proceedings of the 26th ACM international conference on Supercomputing
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs

Proceedings of the 26th ACM international conference on Supercomputing
An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Journal of Computational Physics
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
Accelerating pathology image data cross-comparison on CPU-GPU hybrid systems

Proceedings of the VLDB Endowment
Financial software on GPUs: between Haskell and Fortran

Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
Shared memory multiplexing: a novel way to improve GPGPU throughput

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Optimizing H.264/AVC interprediction on a GPU-based framework

Concurrency and Computation: Practice & Experience
Tsunami: massively parallel homomorphic hashing on many-core GPUs

Concurrency and Computation: Practice & Experience
Algorithmic species: A classification of affine loop nests for parallel programming

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Speeding up k-Means algorithm by GPUs

Journal of Computer and System Sciences
Spill code placement for SIMD machines

SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Finite Element Integration on GPUs

ACM Transactions on Mathematical Software (TOMS)
Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

International Journal of Grid and High Performance Computing
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

International Journal of Computational Science and Engineering
Comparing the performance of stochastic simulation on GPUs and OpenMP

International Journal of Computational Science and Engineering
Iterative statistical kernels on contemporary GPUs

International Journal of Computational Science and Engineering
Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Performance modelling of magnetohydrodynamics codes

EPEW'12 Proceedings of the 9th European conference on Computer Performance Engineering
Performance modelling of magnetohydrodynamics codes

EPEW'12 Proceedings of the 9th European conference on Computer Performance Engineering
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Memory reuse optimizations in the R-Stream compiler

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Performance characterization of data-intensive kernels on AMD Fusion architectures

Computer Science - Research and Development
Complexity of the path avoiding forbidden pairs problem revisited

Discrete Applied Mathematics
Parallel stepwise stochastic simulation: harnessing GPUs to explore possible futures states of a chromosome folding model thanks to the possible futures algorithm (PFA)

Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Parallel implementation of a X-ray tomography reconstruction algorithm based on MPI and CUDA

Proceedings of the 20th European MPI Users' Group Meeting
Automatic parallelization of canonical loops

Science of Computer Programming
Real-time implementation and performance optimization of 3D sound localization on GPUs

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
GPU-based acceleration of an RNA tertiary structure prediction algorithm

Computers in Biology and Medicine
A large-scale cross-architecture evaluation of thread-coarsening

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimising lossless stages in a GPU-based MPEG encoder

Multimedia Tools and Applications
An efficient scheduling scheme using estimated execution time for heterogeneous computing systems

The Journal of Supercomputing
uBench: exposing the impact of CUDA block geometry in terms of performance

The Journal of Supercomputing
Simulating large topologies in ns-3 using BRITE and CUDA driven global routing

Proceedings of the 6th International ICST Conference on Simulation Tools and Techniques
Memory performance estimation of CUDA programs

ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Towards adaptive learning with improved convergence of deep belief networks on graphics processing units

Pattern Recognition
User transparent data and task parallel multimedia computing with Pyxis-DT

Future Generation Computer Systems
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
GPU code generation for ODE-based applications with phased shared-data access patterns

ACM Transactions on Architecture and Code Optimization (TACO)
The Implementation of a High Performance GPGPU Compiler

International Journal of Parallel Programming
An efficient compiler framework for cache bypassing on GPUs

Proceedings of the International Conference on Computer-Aided Design
Optimizing LZSS compression on GPGPUs

Future Generation Computer Systems
3D high definition video coding on a GPU-based heterogeneous system

Computers and Electrical Engineering
Optimising space exploration of OpenCL for GPGPUs

International Journal of Computational Science and Engineering
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs

Proceedings of Workshop on General Purpose Processing Using GPUs
Implementation of LTE system on an SDR platform using CUDA and UHD

Analog Integrated Circuits and Signal Processing
CPU+GPU scheduling with asymptotic profiling

Parallel Computing
Parallel implementation of a real-time high dynamic range video system

Integrated Computer-Aided Engineering
A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform

Advances in Engineering Software

Quantified Score

Hi-index	0.01

Visualization

Abstract

GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread's resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.