Enabling multiple accelerator acceleration for Java/OpenMP

Authors:
Ronald Veldema;Thorsten Blass;Michael Philippsen
Affiliations:
University of Erlangen-Nuremberg, Computer Science Department, Programming Systems Group, Erlangen, Germany;University of Erlangen-Nuremberg, Computer Science Department, Programming Systems Group, Erlangen, Germany;University of Erlangen-Nuremberg, Computer Science Department, Programming Systems Group, Erlangen, Germany
Venue:
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Year:
2011

Citing 10
Cited 0

Digital Image Processing

Digital Image Processing
Core Algorithms of the Maui Scheduler

JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
How to generate and exchange secrets

SFCS '86 Proceedings of the 27th Annual Symposium on Foundations of Computer Science
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
JCudaMP: OpenMP/Java on CUDA

Proceedings of the 3rd International Workshop on Multicore Software Engineering
OMPCUDA: OpenMP execution framework for CUDA based on omni OpenMP compiler

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
CUDASA: compute unified device and systems architecture

EG PGV'08 Proceedings of the 8th Eurographics conference on Parallel Graphics and Visualization

Quantified Score

Hi-index	0.00

Visualization

Abstract

While using a single GPU is fairly easy, using multiple CPUs and GPUs potentially distributed over multiple machines is hard because data needs to be kept consistent using message exchange and the load needs to be balanced. We propose (1) an array package that provides partitioned and replicated arrays and (2) a compute-device library to abstract from GPUs and CPUs and their location. Our system automatically distributes a parallel-for loop in data-parallel fashion over all the devices. There are three contributions in this paper. First, we provide transparent use of multiple distributed GPUs and CPUs from within Java/OpenMP. Second, we partition arrays according to the compute-devices' relative performance that is computed from the execution time of a small micro benchmark and a series of small bandwidth tests run at program start. Third, we repartition the arrays dynamically at run-time by increasing or decreasing the number of machines used and by switching from CPUs-only to GPUs-only, to combinations of CPUs and GPUs, and back. With our dynamic device switching we minimize communication while maximizing device use. Our system automatically finds the optimal device sets and achieves a speedup of 5 - 200 on a cluster of 8 machines with 2 GPUs each.