Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
Direct parallelization of call statements
SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Semantical interprocedural parallelization: an overview of the PIPS project
ICS '91 Proceedings of the 5th international conference on Supercomputing
Interprocedural analyses for programming environments
Environments and tools for parallel scientific computing
Static and dynamic evaluation of data dependence analysis
ICS '93 Proceedings of the 7th international conference on Supercomputing
An interprocedural data flow analysis algorithm
POPL '77 Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
An Implementation of Interprocedural Bounded Regular Section Analysis
IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs
IEEE Transactions on Parallel and Distributed Systems
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Interprocedural Array Region Analyses
LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
ParADE: An OpenMP Programming Environment for SMP Cluster Systems
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Towards OpenMP Execution on Software Distributed Shared Memory Systems
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Paper: A comparative study of automatic vectorizing compilers
Parallel Computing
STEP: a distributed OpenMP for coarse-grain parallelism tool
IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
An Improved Magma Gemm For Fermi Graphics Processing Units
International Journal of High Performance Computing Applications
The rise and fall of high performance Fortran
Communications of the ACM
An Evaluation of Vectorizing Compilers
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Hi-index | 0.00 |
Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes. The STEP tool focuses on generating communications combining both powerful static analyses and runtime execution to reduce the volume of communications. Our source-to-source transformation is implemented inside the PIPS workbench. We detail the generated source program of the Jacobi kernel and show that the execution times and speedups are encouraging. At last we give some directions for the improvement of the tool.