Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Phasers: a unified deadlock-free construct for collective and point-to-point synchronization
Proceedings of the 22nd annual international conference on Supercomputing
Mars: a MapReduce framework on graphics processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Map-reduce as a Programming Model for Custom Computing Machines
FCCM '08 Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines
Mapping Linear Workflows with Computation/Communication Overlap
ICPADS '08 Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems
Work-first and help-first scheduling policies for async-finish task parallelism
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
SLAW: a scalable locality-aware adaptive work-stealing scheduler for multi-core systems
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Lime: a Java-compatible and synthesizable language for heterogeneous architectures
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Customizable Domain-Specific Computing
IEEE Design & Test
DrHJ: a lightweight pedagogic IDE for Habanero Java
Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
A Heterogeneous Parallel Framework for Domain-Specific Languages
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Hierarchical place trees: a portable abstraction for task parallelism and data movement
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Portable performance on heterogeneous architectures
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Portable mapping of openMP to multicore embedded systems using MCA APIs
Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Semi-automatic restructuring of offloadable tasks for many-core accelerators
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient implementation of data flow graphs on multi-gpu clusters
Journal of Real-Time Image Processing
Hi-index | 0.00 |
In this paper we explore mapping of a high-level macro data-flow programming model called Concurrent Collections (CnC) onto heterogeneous platforms in order to achieve high performance and low energy consumption while preserving the ease of use of data-flow programming. Modern computing platforms are becoming increasingly heterogeneous in order to improve energy efficiency. This trend is clearly seen across a diverse spectrum of platforms, from small-scale embedded SOCs to large-scale super-computers. However, programming these heterogeneous platforms poses a serious challenge for application developers. We have designed a software flow for converting high-level CnC programs to the Habanero-C language. CnC programs have a clear separation between the application description, the implementation of each of the application components and the abstraction of hardware platform, making it an excellent programming model for domain experts. Domain experts can later employ the help of a tuning expert (either a compiler or a person) to tune their applications with minimal effort. We also extend the Habanero-C runtime system to support work-stealing across heterogeneous computing devices and introduce task affinity for these heterogeneous components to allow users to fine tune the runtime scheduling decisions. We demonstrate a working example that maps a pipeline of medical image-processing algorithms onto a prototype heterogeneous platform that includes CPUs, GPUs and FPGAs. For the medical imaging domain, where obtaining fast and accurate results is a critical step in diagnosis and treatment of patients, we show that our model offers up to 17.72X speedup and an estimated usage of 0.52X of the power used by CPUs alone, when using accelerators (GPUs and FPGAs) and CPUs.