Design and implementation of a customizable work stealing scheduler

Authors:
Jun Nakashima;Sho Nakatani;Kenjiro Taura
Affiliations:
The University of Tokyo;The University of Tokyo;The University of Tokyo
Venue:
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Year:
2013

Citing 20
Cited 0

A fast algorithm for particle simulations

Journal of Computational Physics
Lazy task creation: a technique for increasing the granularity of parallel programs

LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
StackThreads/MP: integrating futures into calling standards

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
A Java fork/join framework

Proceedings of the ACM 2000 conference on Java Grande
A fast adaptive multipole algorithm in three dimensions

Journal of Computational Physics
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Efficient load balancing for wide-area divide-and-conquer applications

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Satin: Efficient Parallel Divide-and-Conquer in Java

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Adaptive mesh refinement for hyperbolic partial differential equations

Adaptive mesh refinement for hyperbolic partial differential equations
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Intel® threading building blocks

Journal of Computing Sciences in Colleges
An adaptive cut-off for task parallelism

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scioto: A Framework for Global-View Task Parallelism

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scheduling task parallelism on multi-socket multicore systems

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Scheduling irregular parallel computations on hierarchical caches

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Characterizing and mitigating work time inflation in task parallel programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Work-stealing with configurable scheduling strategies

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

An efficient scheduler is important for task parallelism. It should provide scalable dynamic load-balancing mechanism among CPU cores. To meet this requirement, most runtime systems for task parallelism use work stealing as scheduling strategy. Work stealing schedulers typically steal work randomly. This strategy does not consider hardware specific knowledge such as memory hierarchy or application specific knowledge such as cache usage. In order to execute tasks more efficiently, work stealing schedulers should take such knowledge into account. To this end, we propose an API that can customize scheduling strategies and take hardware and application specific knowledge into account while preserving the desirable properties of work stealing. This paper describes the design of our proposed API. Specifically, it provides mechanisms to give scheduling hints for tasks and to implement user-defined work stealing functions. They enable programmers to implement a work stealing strategy optimized for their applications. This paper also presents preliminary evaluation results of the proposed API. A kernel of STREAM microbenchmark improved by 58.8% with a work stealing strategy utilizing data cached by the previous iteration. Performance of matrix multiply improved by 18.2% on 32 AMD cores by a work stealing strategy that tries to steal as a coarse grained task as possible.