StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Authors:
Cédric Augonnet;Samuel Thibault;Raymond Namyst;Pierre-André Wacrenier
Affiliations:
University of Bordeaux, LaBRI–INRIA Bordeaux Sud-Ouest, Talence, France;University of Bordeaux, LaBRI–INRIA Bordeaux Sud-Ouest, Talence, France;University of Bordeaux, LaBRI–INRIA Bordeaux Sud-Ouest, Talence, France;University of Bordeaux, LaBRI–INRIA Bordeaux Sud-Ouest, Talence, France
Venue:
Concurrency and Computation: Practice & Experience - Euro-Par 2009
Year:
2011

Citing 0
Cited 46

Towards jungle computing with Ibis/Constellation

Proceedings of the 2011 workshop on Dynamic distributed data-intensive applications, programming abstractions, and systems
DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
SESAM/Par4All: a tool for joint exploration of MPSoC architectures and dynamic dataflow code generation

Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
An implementation of the tile QR factorization for a GPU and multiple CPUs

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Optimized composition of performance-aware parallel components

Concurrency and Computation: Practice & Experience
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
A scalable framework for heterogeneous GPU-based clusters

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Elastic computing: A portable optimization framework for hybrid computers

Parallel Computing
Enabling large-scale scientific workflows on petascale resources using MPI master/worker

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Harmony: collection and analysis of parallel block vectors

Proceedings of the 39th Annual International Symposium on Computer Architecture
The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Journal of Parallel and Distributed Computing
VForce: An environment for portable applications on high performance systems with accelerators

Journal of Parallel and Distributed Computing
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
ValuePack: value-based scheduling framework for CPU-GPU clusters

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A high-productivity task-based programming model for clusters

Concurrency and Computation: Practice & Experience
High-level support for pipeline parallelism on many-core architectures

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Prius: a runtime for hybrid computing

Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores
Exploring heterogeneous scheduling using the task-centric programming model

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems

Proceedings of the 27th international ACM conference on International conference on supercomputing
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms

Proceedings of the ACM International Conference on Computing Frontiers
Programmability and performance portability aspects of heterogeneous multi-/manycore systems

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Arbiter work stealing for parallelizing games on heterogeneous computing environments

Proceedings of the High Performance Computing Symposium
ViperVM: a runtime system for parallel functional high-performance computing on heterogeneous architectures

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An (almost) direct deployment of the Fast Multipole Method on the Cell processor

The Journal of Supercomputing
Feature-based analysis of large-scale spatio-temporal sensor data on hybrid architectures

International Journal of High Performance Computing Applications
HARS: A hardware-assisted runtime software for embedded many-core architectures

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Multifrontal QR factorization for multicore architectures over runtime systems

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Model and complexity results for tree traversals on hybrid platforms

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Why it is time for a HyPE: a hybrid query processing engine for efficient GPU coprocessing in DBMS

Proceedings of the VLDB Endowment
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
GPUfs: Integrating a file system with GPUs

ACM Transactions on Computer Systems (TOCS)
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Analysis of dependence tracking algorithms for task dataflow execution

ACM Transactions on Architecture and Code Optimization (TACO)
Extending a Run-time Resource Management framework to support OpenCL and Heterogeneous Systems

Proceedings of Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms
An application-centric evaluation of OpenCL on multi-core CPUs

Parallel Computing
Improving application behavior on heterogeneous manycore systems through kernel mapping

Parallel Computing
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Proceedings of Workshop on General Purpose Processing Using GPUs
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Proceedings of Workshop on General Purpose Processing Using GPUs
CPU+GPU scheduling with asymptotic profiling

Parallel Computing
PAAS: Power Aware Algorithm for Scheduling in High Performance Computing

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Performance models and workload distribution algorithms for optimizing a hybrid CPU-GPU multifrontal solver

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run-time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright © 2010 John Wiley & Sons, Ltd.