Generating data transfers for distributed GPU parallel programs

Authors:
F. Silber-Chaussumier;A. Muller;R. Habel
Affiliations:
-;-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 17
Cited 0

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
Direct parallelization of call statements

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Semantical interprocedural parallelization: an overview of the PIPS project

ICS '91 Proceedings of the 5th international conference on Supercomputing
Interprocedural analyses for programming environments

Environments and tools for parallel scientific computing
Static and dynamic evaluation of data dependence analysis

ICS '93 Proceedings of the 7th international conference on Supercomputing
An interprocedural data flow analysis algorithm

POPL '77 Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
An Implementation of Interprocedural Bounded Regular Section Analysis

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs

IEEE Transactions on Parallel and Distributed Systems
Is OpenMP for Grids?

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Interprocedural Array Region Analyses

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
ParADE: An OpenMP Programming Environment for SMP Cluster Systems

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Towards OpenMP Execution on Software Distributed Shared Memory Systems

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Paper: A comparative study of automatic vectorizing compilers

Parallel Computing
STEP: a distributed OpenMP for coarse-grain parallelism tool

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
An Improved Magma Gemm For Fermi Graphics Processing Units

International Journal of High Performance Computing Applications
The rise and fall of high performance Fortran

Communications of the ACM
An Evaluation of Vectorizing Compilers

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes. The STEP tool focuses on generating communications combining both powerful static analyses and runtime execution to reduce the volume of communications. Our source-to-source transformation is implemented inside the PIPS workbench. We detail the generated source program of the Jacobi kernel and show that the execution times and speedups are encouraging. At last we give some directions for the improvement of the tool.