Mint: realizing CUDA performance in 3D stencil methods with annotated C

Authors:
Didem Unat;Xing Cai;Scott B. Baden
Affiliations:
University of California, San Diego, La Jolla, CA, USA;Simula Research Laboratory, Oslo, Norway;University of California, San Diego, La Jolla, CA, USA
Venue:
Proceedings of the international conference on Supercomputing
Year:
2011

Citing 19
Cited 19

Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Treating a User-Defined Parallel Library as a Domain-Specific Language

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Finite Differences And Partial Differential Equations

Finite Differences And Partial Differential Equations
Digital Image Processing (3rd Edition)

Digital Image Processing (3rd Edition)
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)

Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Scalable parallel programming with CUDA

ACM SIGGRAPH 2008 classes
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
A cross-input adaptive framework for GPU program optimizations

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Implementing the PGI Accelerator model

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Chestnut: a GPU programming language for non-experts

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Proceedings of the Tenth International Symposium on Code Generation and Optimization
A systematic process for efficient execution on Intel's heterogeneous computation nodes

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

The Journal of Supercomputing
Patus for convenient high-performance stencils: evaluation in earthquake simulations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
PARTANS: An autotuning framework for stencil computation on multi-GPU systems

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
From physics model to results: an optimizing framework for cross-architecture code generation

Proceedings of the Extreme Scaling Workshop
Implementing OmpSs support for regions of data in architectures with multiple address spaces

Proceedings of the 27th international ACM conference on International conference on supercomputing
Abstractions to separate concerns in semi-regular grids

Proceedings of the 27th international ACM conference on International conference on supercomputing
A software-based dynamic-warp scheduling approach for load-balancing the Viola-Jones face detection algorithm on GPUs

Journal of Parallel and Distributed Computing
Efficient Mapping of Irregular C++ Applications to Integrated GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Proceedings of Workshop on General Purpose Processing Using GPUs
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations

International Journal of Parallel Programming
From physics model to results: An optimizing framework for cross-architecture code generation

Scientific Programming
CaKernel --A parallel application programming framework for heterogenous computing architectures

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional C source. The translator relies on annotations to guide translation at a high level. The set of pragmas is small, and the model is compact and simple. Yet, Mint is able to deliver performance competitive with painstakingly hand-optimized CUDA. We show that, for a set of widely used stencil kernels, Mint realized 80% of the performance obtained from aggressively optimized CUDA on the 200 series NVIDIA GPUs. Our optimizations target three dimensional kernels, which present a daunting array of optimizations.