Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures

Authors:
Daniel Spoonhower;Guy E. Blelloch;Phillip B. Gibbons;Robert Harper
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Intel Research Pittsburgh, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA
Venue:
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Year:
2009

Citing 24
Cited 5

MULTILISP: a language for concurrent symbolic computation

ACM Transactions on Programming Languages and Systems (TOPLAS)
I-structures: data structures for parallel computing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Mul-T: a high-performance parallel Lisp

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Lazy task creation: a technique for increasing the granularity of parallel programs

LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
A report on the Sisal language project

Journal of Parallel and Distributed Computing - Special issue: data-flow processing
Space-efficient scheduling of multithreaded computations

STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Lazy threads: implementing a fast parallel call

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Space-efficient scheduling of parallelism with synchronization variables

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Pipelining with futures

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
A Java fork/join framework

Proceedings of the ACM 2000 conference on Java Grande
The Definition of Standard ML

The Definition of Standard ML
Implementation of multilisp: Lisp on a multiprocessor

LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
Executing functional programs on a virtual tree of processors

FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
Whole-program compilation in MLton

Proceedings of the 2006 workshop on ML
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Deadlock-free scheduling of X10 computations with bounded resources

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Space profiling for parallel functional programs

Proceedings of the 13th ACM SIGPLAN international conference on Functional programming
Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing

Implicitly threaded parallelism in manticore

Journal of Functional Programming
A work-stealing scheduler for X10's task parallelism with suspension

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A meta-scheduler for the par-monad: composable scheduling for the heterogeneous cloud

Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
LVars: lattice-based data structures for deterministic parallelism

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Well-structured futures and cache locality

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Work stealing is a popular method of scheduling fine-grained parallel tasks. The performance of work stealing has been extensively studied, both theoretically and empirically, but primarily for the restricted class of nested-parallel (or fully strict) computations. We extend this prior work by considering a broader class of programs that also supports pipelined parallelism through the use of parallel futures. Though the overhead of work-stealing schedulers is often quantified in terms of the number of steals, we show that a broader metric, the number of deviations, is a better way to quantify work-stealing overhead for less restrictive forms of parallelism, including parallel futures. For such parallelism, we prove bounds on work-stealing overheads--scheduler time and cache misses--as a function of the number of deviations. Deviations can occur, for example, when work is stolen or when a future is touched. We also show instances where deviations can occur independently of steals and touches. Next, we prove that, under work stealing, the expected number of deviations is O(Pd + td) in a P-processor execution of a computation with span d and t touches of futures. Moreover, this bound is existentially tight for any work-stealing scheduler that is parsimonious (those where processors steal only when their queues are empty); this class includes all prior work-stealing schedulers. We also present empirical measurements of the number of deviations incurred by a classic application of futures, Halstead's quicksort, using our parallel implementation of ML. Finally, we identify a family of applications that use futures and, in contrast to quicksort, incur significantly smaller overheads.