Data layout transformation exploiting memory-level parallelism in structured grid many-core applications

Authors:
I-Jui Sung;John A. Stratton;Wen-Mei W. Hwu
Affiliations:
Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Urbana, IL, USA;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Urbana, IL, USA;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 24
Cited 10

Memory storage patterns in parallel processing

Memory storage patterns in parallel processing
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic data layout for high performance Fortran

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Applied numerical linear algebra

Applied numerical linear algebra
The processor-memory bottleneck: problems and solutions

Crossroads - Computer architecture
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Cache-Efficient Multigrid Algorithms

International Journal of High Performance Computing Applications
Numerical Solution of Partial Differential Equations: An Introduction

Numerical Solution of Partial Differential Equations: An Introduction
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies

International Journal of Parallel Programming
SPEC CPU2006 benchmark tools

ACM SIGARCH Computer Architecture News
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
A Burst Scheduling Access Reordering Mechanism

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Lattice Boltzmann based PDE solver on the GPU

The Visual Computer: International Journal of Computer Graphics
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Distributed order scheduling and its application to multi-core dram controllers

Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Partial conflict-relieving programmable address shuffler for parallel memories in multi-core processor

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Data transformations enabling loop vectorization on multithreaded data parallel architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design
Characterizing and improving the use of demand-fetched caches in GPUs

Proceedings of the 26th ACM international conference on Supercomputing
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
A scalable, numerically stable, high-performance tridiagonal solver using GPUs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
More IMPATIENT: A gridding-accelerated Toeplitz-based strategy for non-Cartesian high-resolution 3D MRI on GPUs

Journal of Parallel and Distributed Computing
Semantics-preserving data layout transformations for improved vectorisation

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
An efficient compiler framework for cache bypassing on GPUs

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we have enabled automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 560% performance increases over the language-defined layout, and a 7% performance gain in the worst case, in which the language-defined layout and access pattern is already well-vectorizable by the underlying hardware.