Digital Image Processing
Core Algorithms of the Maui Scheduler
JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
How to generate and exchange secrets
SFCS '86 Proceedings of the 27th Annual Symposium on Foundations of Computer Science
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An asymmetric distributed shared memory model for heterogeneous parallel systems
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Proceedings of the 3rd International Workshop on Multicore Software Engineering
OMPCUDA: OpenMP execution framework for CUDA based on omni OpenMP compiler
IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
CUDASA: compute unified device and systems architecture
EG PGV'08 Proceedings of the 8th Eurographics conference on Parallel Graphics and Visualization
Hi-index | 0.00 |
While using a single GPU is fairly easy, using multiple CPUs and GPUs potentially distributed over multiple machines is hard because data needs to be kept consistent using message exchange and the load needs to be balanced. We propose (1) an array package that provides partitioned and replicated arrays and (2) a compute-device library to abstract from GPUs and CPUs and their location. Our system automatically distributes a parallel-for loop in data-parallel fashion over all the devices. There are three contributions in this paper. First, we provide transparent use of multiple distributed GPUs and CPUs from within Java/OpenMP. Second, we partition arrays according to the compute-devices' relative performance that is computed from the execution time of a small micro benchmark and a series of small bandwidth tests run at program start. Third, we repartition the arrays dynamically at run-time by increasing or decreasing the number of machines used and by switching from CPUs-only to GPUs-only, to combinations of CPUs and GPUs, and back. With our dynamic device switching we minimize communication while maximizing device use. Our system automatically finds the optimal device sets and achieves a speedup of 5 - 200 on a cluster of 8 machines with 2 GPUs each.