Generating efficient data movement code for heterogeneous architectures with distributed-memory

Authors:
Roshan Dathathri;Chandan Reddy;Thejas Ramashekar;Uday Bondhugula
Affiliations:
Indian Institute of Science, Bangalore, India;Indian Institute of Science, Bangalore, India;Indian Institute of Science, Bangalore, India;Indian Institute of Science, Bangalore, India
Venue:
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Year:
2013

Citing 16
Cited 2

A practical algorithm for exact array dependence analysis

Communications of the ACM
Communication optimization and code generation for distributed memory machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Using integer sets for data-parallel program analysis and optimization

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Code Generation in the Polyhedral Model Is Easier Than You Think

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Effective communication coalescing for data-parallel applications

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Implementing the PGI Accelerator model

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Achieving a single compute device image in OpenCL for multiple GPUs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
A hybrid approach of OpenMP for clusters

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization
A scalable framework for heterogeneous GPU-based clusters

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Compiling affine loop nests for distributed-memory parallel architectures

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11x to 83x over state-of-the-art, translating into a mean execution time speedup of 1.53x. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4x to 63.5x over state-of-the-art, resulting in a mean speedup of 1.55x. In addition, our scheme yields a mean speedup of 2.19x over hand-optimized UPC codes.