Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Authors:
Jiayuan Meng;Kevin Skadron
Affiliations:
University of Virginia, Charlottesville, VA, USA;University of Virginia, Charlottesville, VA, USA
Venue:
Proceedings of the 23rd international conference on Supercomputing
Year:
2009

Citing 22
Cited 19

Fortran at ten gigaflops: the connection machine convolution compiler

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Communication-free hyperplane partitioning of nested loops

Journal of Parallel and Distributed Computing
Mobile and replicated alignment of arrays in data-parallel programs

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Techniques for compiling programs on distributed memory multicomputers

Parallel Computing
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Cache-aware multigrid methods for solving Poisson's equation in two dimensions

Computing
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Eliminating redundancies in sum-of-product array computations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Achieving Scalable Locality with Time Skewing

International Journal of Parallel Programming
Redundant Computation Partition on Distributed-Memory Systems

ICA3PP '02 Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing
Compact thermal modeling for temperature-aware design

Proceedings of the 41st annual Design Automation Conference
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache oblivious stencil computations

Proceedings of the 19th annual international conference on Supercomputing
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Three-dimensional multi-relaxation time (MRT) lattice-Boltzmann models for multiphase flow

Journal of Computational Physics
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel Image Processing Based on CUDA

CSSE '08 Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 03

Efficient simulation of agent-based models on multi-GPU and multi-core clusters

Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
Fast seismic modeling and reverse time migration on a graphics processing unit cluster

Concurrency and Computation: Practice & Experience
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
The performance model of an enhanced parallel algorithm for the SOR method

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The performance model for a parallel SOR algorithm using the red-black scheme

International Journal of High Performance Systems Architecture
From physics model to results: an optimizing framework for cross-architecture code generation

Proceedings of the Extreme Scaling Workshop
A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices

Proceedings of the 50th Annual Design Automation Conference
An application-centric evaluation of OpenCL on multi-core CPUs

Parallel Computing
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations

International Journal of Parallel Programming
From physics model to results: An optimizing framework for cross-architecture code generation

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in a grid environment. To automate this process on shared memory systems, we establish a performance model using NVIDIA's Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 98% of the optimal speedup.