Automatic storage management for parallel programs
Parallel Computing - Special issues on languages and compilers for parallel computers
Generation of Efficient Nested Loops from Polyhedra
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Optimizing memory usage in the polyhedral model
ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Code Generation in the Polyhedral Model Is Easier Than You Think
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Lattice-Based Memory Allocation
IEEE Transactions on Computers
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies
International Journal of Parallel Programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
CudaDMA: optimizing GPU memory bandwidth via warp specialization
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
High-performance code generation for stencil computations on GPU architectures
Proceedings of the 26th ACM international conference on Supercomputing
Hi-index | 0.00 |
We propose a new set of automated techniques to optimize memory reuse in programs with explicitly managed memory. Our techniques are inspired by hand-tuned seismic kernels on GPUs. The solutions we develop reduce the cost of transferring data across multiple memories with different bandwidth, latency and addressability properties. They result in reduction of communication volumes from main memory and faster execution speeds, comparable to hand-tuned implementations, for out-of-place stencils. We discuss various steps of our source-to-source compiler infrastructure and focus on specific optimizations which comprise: flexible generation of different granularities of communications with respect to computations, reduction of redundant transfers, reuse of data across processing elements using a globally addressable local memory and reuse of data within the same processing elements using a local private memory. The models of memory we consider in our techniques support the GPU model with device, shared and register memories. The techniques we derive are generally applicable and their formulation within our compiler can be extended to other types of architectures.