Adaptive work stealing with parallelism feedback

  • Authors:
  • Kunal Agrawal;Yuxiong He;Charles E. Leiserson

  • Affiliations:
  • MIT, Cambridge, MA;National University of Singapore, Singapore;MIT, Cambridge, MA

  • Venue:
  • Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an adaptive work-stealing thread scheduler, A-Steal, for fork-join multithreaded jobs, like those written using the Cilk multithreaded language or the Hood work-stealing library. The A-Steal algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. A-Steal provides continual parallelism feedback to a job scheduler in the form of processor requests, and the job must adaptits execution to the processors allotted to it. Assuming that the job scheduler never allots any job more processors than requested by thejob's thread scheduler, A-Steal guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors. Our analysis models the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the system environment and the job scheduler's administrative policies. We analyze the performance of A-Steal using "trim analysis," which allows us to prove that our thread scheduler performs poorly on at most a small number of time steps, while exhibiting near-optimal behavior on the vast majority. To be precise, suppose that a job has work T1 and span (critical-path length)T∞. On a machine with P processors, A-Steal completes the job in expected O(T1/P + T∞ + L lg P) time steps, where L is the length of a scheduling quantum and P denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all but the O(T∞ + L lg P)time steps having the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, P « T1/T∞, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly its span.