Data-Aware Task Scheduling on Multi-accelerator Based Platforms

Authors:
Cedric Augonnet;Jerome Clet-Ortega;Samuel Thibault;Raymond Namyst
Affiliations:
-;-;-;-
Venue:
ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
Year:
2010

Citing 0
Cited 11

Improving programmability of heterogeneous many-core systems via explicit platform descriptions

Proceedings of the 4th International Workshop on Multicore Software Engineering
Petri-nets as an intermediate representation for heterogeneous architectures

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Using explicit platform descriptions to support programming of heterogeneous many-core systems

Parallel Computing
Scheduling processing of real-time data streams on heterogeneous multi-GPU systems

Proceedings of the 5th Annual International Systems and Storage Conference
Power-efficient time-sensitive mapping in heterogeneous systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
StarPU-MPI: task programming over clusters of machines enhanced with accelerators

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Load balancing in a changing world: dealing with heterogeneity and performance variability

Proceedings of the ACM International Conference on Computing Frontiers
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
RSVM: a region-based software virtual memory for GPU

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Efficient implementation of data flow graphs on multi-gpu clusters

Journal of Real-Time Image Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

To fully tap into the potential of heterogeneous machines composed of multicore processors and multiple accelerators, simple offloading approaches in which the main trunk of the application runs on regular cores while only specific parts are offloaded on accelerators are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we previously proposed StarPU, a runtime system capable of scheduling tasks over multicore machines equipped with GPU accelerators. StarPU uses a software virtual shared memory (VSM) that provides a highlevel programming interface and automates data transfers between processing units so as to enable a dynamic scheduling of tasks. We now present how we have extended StarPU to minimize the cost of transfers between processing units in order to efficiently cope with multi-GPU hardware configurations. To this end, our runtime system implements data prefetching based on asynchronous data transfers, and uses data transfer cost prediction to influence the decisions taken by the task scheduler. We demonstrate the relevance of our approach by benchmarking two parallel numerical algorithms using our runtime system. We obtain significant speedups and high efficiency over multicore machines equipped with multiple accelerators. We also evaluate the behaviour of these applications over clusters featuring multiple GPUs per node, showing how our runtime system can combine with MPI.