Effective automatic parallelization of stencil computations

Authors:
Sriram Krishnamoorthy;Muthu Baskaran;Uday Bondhugula;J. Ramanujam;Atanas Rountev;P Sadayappan
Affiliations:
The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;Lousiana State University, Baton Rouge, LA;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH
Venue:
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Year:
2007

Citing 25
Cited 29

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tiling multidimensional iteration spaces for nonshared memory machines

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
(Pen)-ultimate tiling?

Integration, the VLSI Journal
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Determining the idle time of a tiling

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Selecting tile shape for minimal execution time

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Tiling imperfectly-nested loop nests

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
On tiling space-time mapped loop nests

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
High performance Fortran compilation techniques for parallelizing scientific codes

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
Computer and Robot Vision

Computer and Robot Vision
On Time Optimal Supernode Shape

IEEE Transactions on Parallel and Distributed Systems
Program Analysis of Overlap Area Usage in Self-Similar Parallel Programs

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Code generation for multiple mappings

FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
Automatic Blocking of Nested Loops

Automatic Blocking of Nested Loops
A Geometric Programming Framework for Optimal Multi-Level Tiling

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
The memory behavior of cache oblivious stencil computations

The Journal of Supercomputing
Optimal semi-oblique tiling

IEEE Transactions on Parallel and Distributed Systems

Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Global Tiling for Communication Minimal Parallelization on Distributed Memory Systems

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Smashing: Folding Space to Tile through Time

Languages and Compilers for Parallel Computing
A Hardware Task Scheduler for Embedded Video Processing

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
Efficient simulation of agent-based models on multi-GPU and multi-core clusters

Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
Cache oblivious parallelograms in iterative stencil computations

Proceedings of the 24th ACM International Conference on Supercomputing
Exposing tunable parameters in multi-threaded numerical code

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Locality optimization of stencil applications using data dependency graphs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
Efficient parallel stencil convolution in Haskell

Proceedings of the 4th ACM symposium on Haskell
Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Optimizing stencil application on multi-thread GPU architecture using stream programming model

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Extendable pattern-oriented optimization directives

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures

Proceedings of the 26th ACM international conference on Supercomputing
Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimization principles for collective neighborhood communications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A stencil compiler for short-vector SIMD architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing
Performance-reliability tradeoff analysis for multithreaded applications

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications
A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hybrid Hexagonal/Classical Tiling for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Performance optimization of stencil computations has been widely studied in the literature, since they occur in many computationally intensive scientific and engineering applications. Compiler frameworks have also been developed that can transform sequential stencil codes for optimization of data locality and parallelism. However, loop skewing is typically required in order to tile stencil codes along the time dimension, resulting in load imbalance in pipelined parallel execution of the tiles. In this paper, we develop an approach for automatic parallelization of stencil codes, that explicitly addresses the issue of load-balanced execution of tiles. Experimental results are provided that demonstrate the effectiveness of the approach.