A performance study of general-purpose applications on graphics processors using CUDA

Authors:
Shuai Che;Michael Boyer;Jiayuan Meng;David Tarjan;Jeremy W. Sheaffer;Kevin Skadron
Affiliations:
University of Virginia, Department of Computer Science, Charlottesville, VA, USA;University of Virginia, Department of Computer Science, Charlottesville, VA, USA;University of Virginia, Department of Computer Science, Charlottesville, VA, USA;University of Virginia, Department of Computer Science, Charlottesville, VA, USA;University of Virginia, Department of Computer Science, Charlottesville, VA, USA;University of Virginia, Department of Computer Science, Charlottesville, VA, USA
Venue:
Journal of Parallel and Distributed Computing
Year:
2008

Citing 14
Cited 71

Simulation of cloud dynamics on graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Fast computation of database operations using graphics processors

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Metaprogramming GPUs with Sh

Metaprogramming GPUs with Sh
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Implicitly parallel programming models for thousand-core microprocessors

Proceedings of the 44th annual Design Automation Conference
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Parallel sorting on ILLIAC array processor

ISTASC'07 Proceedings of the 7th Conference on 7th WSEAS International Conference on Systems Theory and Scientific Computation - Volume 7
Scalable Parallel Programming with CUDA

Queue - GPU Computing
GPU acceleration of cutoff pair potentials for molecular modeling applications

Proceedings of the 5th conference on Computing frontiers
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Hotspot: acompact thermal modeling methodology for early-stage VLSI design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Speckle reducing anisotropic diffusion

IEEE Transactions on Image Processing

Clustering billions of data points using GPUs

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Accelerating total variation regularization for matrix-valued images on GPUs

Proceedings of the 6th ACM conference on Computing frontiers
Using common graphics hardware for multi-agent traffic simulation with CUDA

Proceedings of the 2nd International Conference on Simulation Tools and Techniques
Frequent itemset mining on graphics processors

Proceedings of the Fifth International Workshop on Data Management on New Hardware
On GPU's viability as a middleware accelerator

Cluster Computing
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Complexity effective memory access scheduling for many-core accelerator architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Swarm's flight: accelerating the particles using C-CUDA

CEC'09 Proceedings of the Eleventh conference on Congress on Evolutionary Computation
Fast Pattern Classification of Ventricular Arrhythmias Using Graphics Processing Units

CIARP '09 Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Accelerating SQL database operations on a GPU with CUDA

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
GPU implementation of the multiple back-propagation algorithm

IDEAL'09 Proceedings of the 10th international conference on Intelligent data engineering and automated learning
Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
A GPU based implementation of center-surround distribution distance for feature extraction and matching

Proceedings of the Conference on Design, Automation and Test in Europe
Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the International Space Station

Journal of Real-Time Image Processing
Non-negative matrix factorization implementation using graphic processing units

IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Parallel processing with CUDA in ceramic tiles classification

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I
Optimizing memory access on GPUs using morton order indexing

Proceedings of the 48th Annual Southeast Regional Conference
Data-intensive document clustering on graphics processing unit (GPU) clusters

Journal of Parallel and Distributed Computing
Database compression on graphics processors

Proceedings of the VLDB Endowment
Development of a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI)

Journal of Computational Physics
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
A new method for GPU based irregular reductions and its application to k-means clustering

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Floating-point data compression at 75 Gb/s on a GPU

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
MPI-CUDA parallelization of a finite-strip program for geometric nonlinear analysis: A hybrid approach

Advances in Engineering Software
Simulation of bevel gear cutting with GPGPUs--performance and productivity

Computer Science - Research and Development
On the GPGPU parallelization issues of finite element approximate inverse preconditioning

Journal of Computational and Applied Mathematics
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Comparing Parallel Simulation of Social Agents Using Cilk and OpenCL

DS-RT '11 Proceedings of the 2011 IEEE/ACM 15th International Symposium on Distributed Simulation and Real Time Applications
Image and video processing on CUDA: state of the art and future directions

MACMESE'11 Proceedings of the 13th WSEAS international conference on Mathematical and computational methods in science and engineering
Exploring the limits of GPGPU scheduling in control flow bound applications

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Development of parallel explicit finite element sheet forming simulation system based on GPU architecture

Advances in Engineering Software
GPU-based parallel collision detection for fast motion planning

International Journal of Robotics Research
A GPU implementation of inclusion-based points-to analysis

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Safe and familiar multi-core programming by means of a hybrid functional and imperative language

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Reducing off-chip memory traffic by selective cache management scheme in GPGPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
CPU/GPU computing for long-wave radiation physics on large GPU clusters

Computers & Geosciences
Local search algorithms on graphics processing units. a case study: the permutation perceptron problem

EvoCOP'10 Proceedings of the 10th European conference on Evolutionary Computation in Combinatorial Optimization
A co-evolutionary differential evolution algorithm for solving min-max optimization problems implemented on GPU using C-CUDA

Expert Systems with Applications: An International Journal
GICUDA: A parallel program for 3D correlation imaging of large scale gravity and gravity gradiometry data on graphics processing units with CUDA

Computers & Geosciences
A framework for GPU accelerated deformable object modeling

International Journal of High Performance Computing Applications
Efficient acquisition and clustering of local histograms for representing voxel neighborhoods

VG'10 Proceedings of the 8th IEEE/EG international conference on Volume Graphics
Three-dimensional thinning algorithms on graphics processing units and multicore CPUs

Concurrency and Computation: Practice & Experience
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallel solution of the subset-sum problem: an empirical study

Concurrency and Computation: Practice & Experience
A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Speeding up k-Means algorithm by GPUs

Journal of Computer and System Sciences
Performance evaluation of OpenMP and CUDA on multicore systems

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Parallel approaches to machine learning-A comprehensive survey

Journal of Parallel and Distributed Computing
Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

International Journal of Grid and High Performance Computing
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Performance characterization of data-intensive kernels on AMD Fusion architectures

Computer Science - Research and Development
Parallel multi-objective Ant Programming for classification using GPUs

Journal of Parallel and Distributed Computing
Parallel multi-dimensional range query processing with R-trees on GPU

Journal of Parallel and Distributed Computing
Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA)

The Journal of Supercomputing
A GPU implementation of a structural-similarity-based aerial-image classification

The Journal of Supercomputing
An efficient scheduling scheme using estimated execution time for heterogeneous computing systems

The Journal of Supercomputing
Assessing the performance of OpenMP programs on the intel xeon phi

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
A Novel CPU-GPU Cooperative Implementation of A Parallel Two-List Algorithm for the Subset-Sum Problem

Proceedings of Programming Models and Applications on Multicores and Manycores
Parallel evaluation of Pittsburgh rule-based classifiers on GPUs

Neurocomputing
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
Dynamic load balancing on heterogeneous multi-GPU systems

Computers and Electrical Engineering
Optimising space exploration of OpenCL for GPGPUs

International Journal of Computational Science and Engineering
Population-based harmony search using GPU applied to protein structure prediction

International Journal of Computational Science and Engineering
Implementation of LTE system on an SDR platform using CUDA and UHD

Analog Integrated Circuits and Signal Processing
Accelerating FCM neural network classifier using graphics processing units with CUDA

Applied Intelligence
A coarse-grained parallel approach for seismic damage simulations of urban areas based on refined models and GPU/CPU cooperative computing

Advances in Engineering Software
Multichannel massive audio processing for a generalized crosstalk cancellation and equalization application using GPUs

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA's C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications.