Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures

  • Authors:
  • Daniel Spoonhower;Guy E. Blelloch;Phillip B. Gibbons;Robert Harper

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Intel Research Pittsburgh, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA

  • Venue:
  • Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Work stealing is a popular method of scheduling fine-grained parallel tasks. The performance of work stealing has been extensively studied, both theoretically and empirically, but primarily for the restricted class of nested-parallel (or fully strict) computations. We extend this prior work by considering a broader class of programs that also supports pipelined parallelism through the use of parallel futures. Though the overhead of work-stealing schedulers is often quantified in terms of the number of steals, we show that a broader metric, the number of deviations, is a better way to quantify work-stealing overhead for less restrictive forms of parallelism, including parallel futures. For such parallelism, we prove bounds on work-stealing overheads--scheduler time and cache misses--as a function of the number of deviations. Deviations can occur, for example, when work is stolen or when a future is touched. We also show instances where deviations can occur independently of steals and touches. Next, we prove that, under work stealing, the expected number of deviations is O(Pd + td) in a P-processor execution of a computation with span d and t touches of futures. Moreover, this bound is existentially tight for any work-stealing scheduler that is parsimonious (those where processors steal only when their queues are empty); this class includes all prior work-stealing schedulers. We also present empirical measurements of the number of deviations incurred by a classic application of futures, Halstead's quicksort, using our parallel implementation of ML. Finally, we identify a family of applications that use futures and, in contrast to quicksort, incur significantly smaller overheads.