Data layout transformation for stencil computations on short-vector SIMD architectures

Authors:
Tom Henretty;Kevin Stock;Louis-Noël Pouchet;Franz Franchetti;J. Ramanujam;P. Sadayappan
Affiliations:
The Ohio State University;The Ohio State University;The Ohio State University;Carnegie Mellon University;Louisiana State University;The Ohio State University
Venue:
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Year:
2011

Citing 35
Cited 7

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Communication optimization and code generation for distributed memory machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Automatic array alignment in data-parallel programs

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Automatic data layout for distributed-memory machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Nonsingular Data Transformations: Definition, Validity, and Applications

International Journal of Parallel Programming
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Achieving Scalable Locality with Time Skewing

International Journal of Parallel Programming
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Proceedings of the international symposium on Code generation and optimization
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
Sketching stencils

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A SIMD optimization framework for retargetable compilers

ACM Transactions on Architecture and Code Optimization (TACO)
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Proceedings of the 23rd international conference on Supercomputing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
A Multilevel Parallelization Framework for High-Order Stencil Computations

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization

COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

SIAM Review
Mapping the FDTD Application to Many-Core Chip Architectures

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Data transformations enabling loop vectorization on multithreaded data parallel architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
New algorithms for SIMD alignment

CC'07 Proceedings of the 16th international conference on Compiler construction

A compiler framework for extracting superword level parallelism

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
When polyhedral transformations meet SIMD code generation

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A stencil compiler for short-vector SIMD architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications
Semantics-preserving data layout transformations for improved vectorisation

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Compiling Scilab to high performance embedded multicore systems

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on three modern SIMD-capable processors.