Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Authors:
Muthu Manikandan Baskaran;Uday Bondhugula;Sriram Krishnamoorthy;J. Ramanujam;Atanas Rountev;P. Sadayappan
Affiliations:
The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;Louisiana State University, Baton Rouge, LA, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA
Venue:
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Year:
2008

Citing 28
Cited 34

Strategies for cache and local memory management by global program transformation

Proceedings of the 1st International Conference on Supercomputing
A practical algorithm for exact array dependence analysis

Communications of the ACM
Some efficient solutions to the affine scheduling problem: I. One-dimensional time

International Journal of Parallel Programming
Counting solutions to Presburger formulas: how and why

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Maximizing parallelism and minimizing synchronization with affine transforms

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs

International Journal of Parallel Programming
Exact memory size estimation for array computations without loop unrolling

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Data and memory optimization techniques for embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Reducing memory requirements of nested loops for embedded systems

Proceedings of the 38th annual Design Automation Conference
Storage Management Programmable Process

Storage Management Programmable Process
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Automatic Parallelization in the Polytope Model

The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications
Compiler Optimizations for Real Time Execution of Loops on Limited Memory Embedded Systems

RTSS '98 Proceedings of the IEEE Real-Time Systems Symposium
Improving parallelism and data locality with affine partitioning

Improving parallelism and data locality with affine partitioning
Programming for parallelism and locality with hierarchically tiled arrays

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies

Proceedings of the 43rd annual Design Automation Conference
Loop Transformation Methodologies for Array-Oriented Memory Management

ASAP '06 Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors
Violated dependence analysis

Proceedings of the 20th annual international conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Proceedings of the International Symposium on Code Generation and Optimization
Multi-level tiling: M for the price of one

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Compiler-directed scratch pad memory optimization for embedded multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2002 international symposium on low-power electronics and design (ISLPED)
A compiler-based approach for dynamically managing scratch-pad memories in embedded systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
A cost-effective implementation of multilevel tiling

IEEE Transactions on Parallel and Distributed Systems

A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
Precise Management of Scratchpad Memories for Localising Array Accesses in Scientific Codes

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Data transformations enabling loop vectorization on multithreaded data parallel architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Adding data-movement constructs to the PGAS parallel computing model

Proceedings of the 7th European Lisp Workshop
Compiler-directed memory management for heterogeneous MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Fault oblivious eXascale whitepaper

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Proceedings of the international conference on Supercomputing
A reuse-aware prefetching scheme for scratchpad memory

Proceedings of the 48th Design Automation Conference
Combined loop transformation and hierarchy allocation for data reuse optimization

Proceedings of the International Conference on Computer-Aided Design
Identifying hotspots in a program for data parallel architecture: an early experience

Proceedings of the 5th India Software Engineering Conference
Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
Optimizing memory hierarchy allocation with loop transformations for high-level synthesis

Proceedings of the 49th Annual Design Automation Conference
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators

Concurrency and Computation: Practice & Experience
Optimizing MapReduce for GPUs with effective shared memory usage

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral-based data reuse optimization for configurable computing

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA

Proceedings of the Conference on Design, Automation and Test in Europe
Compiling affine loop nests for distributed-memory parallel architectures

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several parallel architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. They also have very high computational power with multiple levels of parallelism. A significant challenge in programming these architectures is to effectively exploit the parallelism available in the architecture and manage the fast memories to maximize performance. In this paper we develop an approach to effective automatic data management for on-chip memories, including creation of buffers in on-chip (local) memories for holding portions of data accessed in a computational block, automatic determination of array access functions of local buffer references, and generation of code that moves data between slow off-chip memory and fast local memories. We also address the problem of mapping computation in regular programs to multi-level parallel architectures using a multi-level tiling approach, and study the impact of on-chip memory availability on the selection of tile sizes at various levels. Experimental results on a GPU demonstrate the effectiveness of the proposed approach.