Scheduling parallel programs by work stealing with private deques

Authors:
Umut A. Acar;Arthur Chargueraud;Mike Rainey
Affiliations:
Carnegie Mellon University, Pittsburgh, USA;Inria Saclay, Paris, France;Max Planck Institute for Software Systems, Kaiserslautern, Germany
Venue:
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2013

Citing 20
Cited 4

Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Analyses of load stealing models based on differential equations

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Work dealing

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Non-blocking steal-half work queues

Proceedings of the twenty-first annual symposium on Principles of distributed computing
The Effect of Scheduling Discipline on Dynamic Load Sharing in Heterogeneous Distributed Systems

MASCOTS '97 Proceedings of the 5th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
The Natural Work-Stealing Algorithm is Stable

SIAM Journal on Computing
Dynamic circular work-stealing deque

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
A dynamic-sized nonblocking work stealing deque

Distributed Computing - Special issue: DISC 04
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Idempotent work stealing

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Backtracking-based load balancing

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Regular, shape-polymorphic, parallel arrays in Haskell

Proceedings of the 15th ACM SIGPLAN international conference on Functional programming
Internally deterministic parallel algorithms can be fast

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A performance model for X10 applications: what's going on under the hood?

Proceedings of the 2011 ACM SIGPLAN X10 Workshop

Hybrid parallel task placement in X10

Proceedings of the third ACM SIGPLAN X10 Workshop
Energy-efficient work-stealing language runtimes

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Fence-free work stealing on bounded TSO processors

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Friendly barriers: efficient work-stealing with return barriers

Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

Quantified Score

Hi-index	0.00

Visualization

Abstract

Work stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each processor operates on its deque locally except when performing load balancing via steals. Unfortunately, concurrent deques suffer from two limitations: 1) local deque operations require expensive memory fences in modern weak-memory architectures, 2) they can be very difficult to extend to support various optimizations and flexible forms of task distribution strategies needed many applications, e.g., those that do not fit nicely into the divide-and-conquer, nested data parallel paradigm. For these reasons, there has been a lot recent interest in implementations of work stealing with non-concurrent deques, where deques remain entirely private to each processor and load balancing is performed via message passing. Private deques eliminate the need for memory fences from local operations and enable the design and implementation of efficient techniques for reducing task-creation overheads and improving task distribution. These advantages, however, come at the cost of communication. It is not known whether work stealing with private deques enjoys the theoretical guarantees of concurrent deques and whether they can be effective in practice. In this paper, we propose two work-stealing algorithms with private deques and prove that the algorithms guarantee similar theoretical bounds as work stealing with concurrent deques. For the analysis, we use a probabilistic model and consider a new parameter, the branching depth of the computation. We present an implementation of the algorithm as a C++ library and show that it compares well to Cilk on a range of benchmarks. Since our approach relies on private deques, it enables implementing flexible task creation and distribution strategies. As a specific example, we show how to implement task coalescing and steal-half strategies, which can be important in fine-grain, non-divide-and-conquer algorithms such as graph algorithms, and apply them to the depth-first-search problem.