Memory reuse optimizations in the R-Stream compiler

Authors:
Nicolas Vasilache;Muthu Baskaran;Benoit Meister;Richard Lethin
Affiliations:
Reservoir Labs Inc., New-York, NY;Reservoir Labs Inc., New-York, NY;Reservoir Labs Inc., New-York, NY;Reservoir Labs Inc., New-York, NY
Venue:
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Year:
2013

Citing 14
Cited 0

Automatic storage management for parallel programs

Parallel Computing - Special issues on languages and compilers for parallel computers
Generation of Efficient Nested Loops from Polyhedra

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Optimizing memory usage in the polyhedral model

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Code Generation in the Polyhedral Model Is Easier Than You Think

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Lattice-Based Memory Allocation

IEEE Transactions on Computers
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies

International Journal of Parallel Programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
CudaDMA: optimizing GPU memory bandwidth via warp specialization

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
High-performance code generation for stencil computations on GPU architectures

Proceedings of the 26th ACM international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new set of automated techniques to optimize memory reuse in programs with explicitly managed memory. Our techniques are inspired by hand-tuned seismic kernels on GPUs. The solutions we develop reduce the cost of transferring data across multiple memories with different bandwidth, latency and addressability properties. They result in reduction of communication volumes from main memory and faster execution speeds, comparable to hand-tuned implementations, for out-of-place stencils. We discuss various steps of our source-to-source compiler infrastructure and focus on specific optimizations which comprise: flexible generation of different granularities of communications with respect to computations, reduction of redundant transfers, reuse of data across processing elements using a globally addressable local memory and reuse of data within the same processing elements using a local private memory. The models of memory we consider in our techniques support the GPU model with device, shared and register memories. The techniques we derive are generally applicable and their formulation within our compiler can be extended to other types of architectures.