Adaptive work stealing with parallelism feedback

Authors:
Kunal Agrawal;Yuxiong He;Charles E. Leiserson
Affiliations:
MIT, Cambridge, MA;National University of Singapore, Singapore;MIT, Cambridge, MA
Venue:
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2007

Citing 47
Cited 8

DIB—a distributed implementation of backtracking

ACM Transactions on Programming Languages and Systems (TOPLAS)
A randomized parallel branch-and-bound procedure

STOC '88 Proceedings of the twentieth annual ACM symposium on Theory of computing
Speedup Versus Efficiency in Parallel Systems

IEEE Transactions on Computers
Process control and scheduling issues for multiprogrammed shared-memory multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Characterizations of parallelism in applications and their use in scheduling

SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Lazy task creation: a technique for increasing the granularity of parallel programs

LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
A simple load balancing scheme for task allocation in parallel machines

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
The Processor Working Set and its Use in Scheduling Multiprocessor Systems

IEEE Transactions on Software Engineering
Dynamic Processor Self-Scheduling for General Parallel Nested Loops

IEEE Transactions on Computers
Low-overhead scheduling of nested parallelism

IBM Journal of Research and Development
A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Counting networks

Journal of the ACM (JACM)
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Randomized algorithms

Randomized algorithms
Provably efficient scheduling for languages with fine-grained parallelism

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Elimination trees and the construction of pools and stacks: preliminary version

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
A provable time and space efficient implementation of NESL

Proceedings of the first ACM SIGPLAN international conference on Functional programming
Diffracting trees

ACM Transactions on Computer Systems (TOCS)
On multiprocessor system scheduling

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Executing multithreaded programs efficiently

Executing multithreaded programs efficiently
Using parallel program characteristics in dynamic processor allocation policies

Performance Evaluation
Space-Efficient Scheduling of Multithreaded Computations

SIAM Journal on Computing
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The performance of work stealing in multiprogrammed environments (extended abstract)

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Scheduling in the dark

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
Non-clairvoyant scheduling

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Space-efficient scheduling of nested parallelism

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Parallel Evaluation of General Arithmetic Expressions

Journal of the ACM (JACM)
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Introduction to Algorithms

Introduction to Algorithms
Maximizing Speedup through Self-Tuning of Processor Allocation

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Multiprocessor Scheduling for High-Variability Service Time Distributions

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Using Runtime Measured Workload Characteristics in Parallel Processor Scheduling

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Dynamic vs. Static Quantum-Based Parallel Processor Allocation

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Implementation of multilisp: Lisp on a multiprocessor

LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
Executing functional programs on a virtual tree of processors

FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
Non-clair voy ant multiprocessor scheduling of jobs with changing execution characteristics

Journal of Scheduling - Special issue: On-line scheduling
The counting pyramid: an adaptive distributed counting scheme

Journal of Parallel and Distributed Computing
Non-Clairvoyant Scheduling for Minimizing Mean Slowdown

Algorithmica
Adaptive scheduling with parallelism feedback

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalable synchronous queues

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
An Empirical Evaluation ofWork Stealing with Parallelism Feedback

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Provably efficient two-level adaptive scheduling

JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing

Adaptive work-stealing with parallelism feedback

ACM Transactions on Computer Systems (TOCS)
Fine-Grained Task Scheduling Using Adaptive Data Structures

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Improved results for scheduling batched parallel jobs by using a generalized analysis framework

Journal of Parallel and Distributed Computing
Lazy binary-splitting: a run-time adaptive work-stealing scheduler

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Lightweight asynchrony using parasitic threads

Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming
Performance driven distributed scheduling of parallel hybrid computations

Theoretical Computer Science
Energy-efficient work-stealing language runtimes

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Well-structured futures and cache locality

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an adaptive work-stealing thread scheduler, A-Steal, for fork-join multithreaded jobs, like those written using the Cilk multithreaded language or the Hood work-stealing library. The A-Steal algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. A-Steal provides continual parallelism feedback to a job scheduler in the form of processor requests, and the job must adaptits execution to the processors allotted to it. Assuming that the job scheduler never allots any job more processors than requested by thejob's thread scheduler, A-Steal guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors. Our analysis models the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the system environment and the job scheduler's administrative policies. We analyze the performance of A-Steal using "trim analysis," which allows us to prove that our thread scheduler performs poorly on at most a small number of time steps, while exhibiting near-optimal behavior on the vast majority. To be precise, suppose that a job has work T1 and span (critical-path length)T∞. On a machine with P processors, A-Steal completes the job in expected O(T1/P + T∞ + L lg P) time steps, where L is the length of a scheduling quantum and P denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all but the O(T∞ + L lg P)time steps having the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, P « T1/T∞, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly its span.