MDR: performance model driven runtime for heterogeneous parallel platforms

Authors:
Jacques A. Pienaar;Anand Raghunathan;Srimat Chakradhar
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;NEC Laboratories America, Princeton, NJ, USA
Venue:
Proceedings of the international conference on Supercomputing
Year:
2011

Citing 22
Cited 5

Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Machine Learning

Machine Learning
A taxonomy of scientific workflow systems for grid computing

ACM SIGMOD Record
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Harmony: an execution model and runtime for heterogeneous many core systems

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Amdahl's Law in the Multicore Era

Computer
Predictive Runtime Code Scheduling for Heterogeneous Architectures

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
CellSs: Scheduling techniques to better exploit memory hierarchy

Scientific Programming - High Performance Computing with the Cell Broadband Engine
A framework for efficient and scalable execution of domain-specific templates on GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Best-effort semantic document search on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

IEEE Design & Test
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
MapCG: writing parallel program portable between CPU and GPU

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Maestro: data orchestration and tuning for OpenCL devices

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II

GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A systematic process for efficient execution on Intel's heterogeneous computation nodes

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SemCache: semantics-aware caching for efficient GPU offloading

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a runtime framework for the execution of work-loads represented as parallel-operator directed acyclic graphs (PO-DAGs) on heterogeneous multi-core platforms. PO-DAGs combine coarse-grained parallelism at the graph level with fine-grained parallelism within each node, lending naturally to exploiting the intra --- and inter-processing element parallelism present in heterogeneous platforms. We identify four important criteria - Suitability, Locality, Availability and Criticality (SLAC) --- and show that all these criteria must be considered by a heterogeneous runtime framework in order to achieve good performance under varying application and platform characteristics. The proposed model driven runtime (MDR) considers all the aforementioned factors, and tradeoffs among them, by utilizing performance models. These performance models are used to drive key run-time decisions such as mapping of tasks to PEs, scheduling of tasks on each PE, and copying data between memory spaces. We discuss the software architecture and implementation of MDR, and evaluate it using several benchmark programs on three different heterogeneous platforms that contain multi-core CPUs and GPUs. The hardware platforms represent server, laptop, and netbook class systems. MDR achieves up to 4.2X speedup (1.5X on average) over the best of CPU-only, GPU-only, round-robin, GPU-first, and utilization-driven schedulers. We also perform a sensitivity analysis that establishes the importance of considering all four SLAC criteria in order to achieve high performance execution in a heterogeneous runtime framework.