IEEE Transactions on Parallel and Distributed Systems
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Machine Learning
A taxonomy of scientific workflow systems for grid computing
ACM SIGMOD Record
Accelerator: using data parallelism to program GPUs for general-purpose uses
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Merge: a programming model for heterogeneous multi-core systems
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Harmony: an execution model and runtime for heterogeneous many core systems
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Amdahl's Law in the Multicore Era
Computer
Predictive Runtime Code Scheduling for Heterogeneous Architectures
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
CellSs: Scheduling techniques to better exploit memory hierarchy
Scientific Programming - High Performance Computing with the Cell Broadband Engine
A framework for efficient and scalable execution of domain-specific templates on GPUs
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Best-effort semantic document search on GPUs
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
An asymmetric distributed shared memory model for heterogeneous parallel systems
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems
IEEE Design & Test
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
MapCG: writing parallel program portable between CPU and GPU
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Maestro: data orchestration and tuning for OpenCL devices
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A systematic process for efficient execution on Intel's heterogeneous computation nodes
Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Automatic generation of software pipelines for heterogeneous parallel systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SemCache: semantics-aware caching for efficient GPU offloading
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
We present a runtime framework for the execution of work-loads represented as parallel-operator directed acyclic graphs (PO-DAGs) on heterogeneous multi-core platforms. PO-DAGs combine coarse-grained parallelism at the graph level with fine-grained parallelism within each node, lending naturally to exploiting the intra --- and inter-processing element parallelism present in heterogeneous platforms. We identify four important criteria - Suitability, Locality, Availability and Criticality (SLAC) --- and show that all these criteria must be considered by a heterogeneous runtime framework in order to achieve good performance under varying application and platform characteristics. The proposed model driven runtime (MDR) considers all the aforementioned factors, and tradeoffs among them, by utilizing performance models. These performance models are used to drive key run-time decisions such as mapping of tasks to PEs, scheduling of tasks on each PE, and copying data between memory spaces. We discuss the software architecture and implementation of MDR, and evaluate it using several benchmark programs on three different heterogeneous platforms that contain multi-core CPUs and GPUs. The hardware platforms represent server, laptop, and netbook class systems. MDR achieves up to 4.2X speedup (1.5X on average) over the best of CPU-only, GPU-only, round-robin, GPU-first, and utilization-driven schedulers. We also perform a sensitivity analysis that establishes the importance of considering all four SLAC criteria in order to achieve high performance execution in a heterogeneous runtime framework.