Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Authors:
Tobias Grosser;Albert Cohen;Paul H. J. Kelly;J. Ramanujam;P. Sadayappan;Sven Verdoolaege
Affiliations:
École Normale Supérieure;École Normale Supérieure;Imperial College London;Louisiana State University;Ohio State University;École Normale Supérieure
Venue:
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Year:
2013

Citing 17
Cited 2

Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

SIAM Review
Cache oblivious parallelograms in iterative stencil computations

Proceedings of the 24th ACM International Conference on Supercomputing
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Model-driven tile size selection for DOACROSS loops on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Cache Accurate Time Skewing in Iterative Stencil Computations

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
PADS: A Pattern-Driven Stencil Compiler-Based Tool for Reuse of Optimizations on GPGPUs

ICPADS '11 Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems
Combining performance aspects of irregular gauss-seidel via sparse tiling

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures

Proceedings of the 26th ACM international conference on Supercomputing
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

A stencil compiler for short-vector SIMD architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing
Hybrid Hexagonal/Classical Tiling for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism. One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.