Adaptive work-stealing with parallelism feedback

Authors:
Kunal Agrawal;Charles E. Leiserson;Yuxiong He;Wen Jing Hsu
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA;Massachusetts Institute of Technology, Cambridge, MA;Nanyang Technological University;Nanyang Technological University
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2008

Citing 59
Cited 9

DIB—a distributed implementation of backtracking

ACM Transactions on Programming Languages and Systems (TOPLAS)
A randomized parallel branch-and-bound procedure

STOC '88 Proceedings of the twentieth annual ACM symposium on Theory of computing
Speedup Versus Efficiency in Parallel Systems

IEEE Transactions on Computers
Process control and scheduling issues for multiprogrammed shared-memory multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Characterizations of parallelism in applications and their use in scheduling

SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Lazy task creation: a technique for increasing the granularity of parallel programs

LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
The performance of multiprogrammed multiprocessor scheduling algorithms

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A simple load balancing scheme for task allocation in parallel machines

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
The Processor Working Set and its Use in Scheduling Multiprocessor Systems

IEEE Transactions on Software Engineering
Dynamic Processor Self-Scheduling for General Parallel Nested Loops

IEEE Transactions on Computers
Low-overhead scheduling of nested parallelism

IBM Journal of Research and Development
A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Application scheduling and processor allocation in multiprogrammed parallel processing systems

Performance Evaluation - Special issue: performance modeling of parallel processing systems
Robust partitioning policies of multiprocessor systems

Performance Evaluation - Special issue: performance modeling of parallel processing systems
Counting networks

Journal of the ACM (JACM)
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Randomized algorithms

Randomized algorithms
Provably efficient scheduling for languages with fine-grained parallelism

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
A provable time and space efficient implementation of NESL

Proceedings of the first ACM SIGPLAN international conference on Functional programming
On multiprocessor system scheduling

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Executing multithreaded programs efficiently

Executing multithreaded programs efficiently
Using parallel program characteristics in dynamic processor allocation policies

Performance Evaluation
Exploiting process lifetime distributions for dynamic load balancing

ACM Transactions on Computer Systems (TOCS)
Space-Efficient Scheduling of Multithreaded Computations

SIAM Journal on Computing
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The performance of work stealing in multiprogrammed environments (extended abstract)

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Scheduling in the dark

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
Non-clairvoyant scheduling

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Preemptive scheduling of parallel jobs on multiprocessors

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
Space-efficient scheduling of nested parallelism

ACM Transactions on Programming Languages and Systems (TOPLAS)
Load-balancing heuristics and process behavior

SIGMETRICS '86/PERFORMANCE '86 Proceedings of the 1986 ACM SIGMETRICS joint international conference on Computer performance modelling, measurement and evaluation
The Parallel Evaluation of General Arithmetic Expressions

Journal of the ACM (JACM)
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Non-blocking steal-half work queues

Proceedings of the twenty-first annual symposium on Principles of distributed computing
Introduction to Algorithms

Introduction to Algorithms
A parallel workload model and its implications for processor allocation

Cluster Computing
Maximizing Speedup through Self-Tuning of Processor Allocation

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Model for Moldable Supercomputer Jobs

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Multiprocessor Scheduling for High-Variability Service Time Distributions

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
On the Benefits and Limitations of Dynamic Partitioning in Parallel Computer Systems

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Analysis of Non-Work-Conserving Processor Partitioning Policies

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Packing Schemes for Gang Scheduling

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Using Runtime Measured Workload Characteristics in Parallel Processor Scheduling

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Dynamic vs. Static Quantum-Based Parallel Processor Allocation

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Implementation of multilisp: Lisp on a multiprocessor

LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
Executing functional programs on a virtual tree of processors

FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
The workload on parallel supercomputers: modeling the characteristics of rigid jobs

Journal of Parallel and Distributed Computing
Non-clair voy ant multiprocessor scheduling of jobs with changing execution characteristics

Journal of Scheduling - Special issue: On-line scheduling
Non-Clairvoyant Scheduling for Minimizing Mean Slowdown

Algorithmica
Dynamic circular work-stealing deque

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Adaptive scheduling with parallelism feedback

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
An Empirical Evaluation ofWork Stealing with Parallelism Feedback

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
A dynamic-sized nonblocking work stealing deque

Distributed Computing - Special issue: DISC 04
Adaptive work stealing with parallelism feedback

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference

On the costs and benefits of stochasticity in stream processing

Proceedings of the 47th Design Automation Conference
Hardware/software support for adaptive work-stealing in on-chip multiprocessor

Journal of Systems Architecture: the EUROMICRO Journal
Vertical stealing: robust, locality-aware do-all workload distribution for 3D MPSoCs

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Space-efficient scheduling of stochastically generated tasks

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming: Part II
Space-efficient scheduling of stochastically generated tasks

Information and Computation
BWS: balanced work stealing for time-sharing multicores

Proceedings of the 7th ACM european conference on Computer Systems
Processor allocation for optimistic parallelization of irregular programs

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing

Proceedings of Programming Models and Applications on Multicores and Manycores
DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core Architectures

Proceedings of Programming Models and Applications on Multicores and Manycores

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors. We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority. More precisely, suppose that a job has work T1 and span T∞. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/&Ptilde; + T∞ + L lg P) time steps, where L is the length of a scheduling quantum, and &Ptilde; denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T∞ + L lg P) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, &Ptilde; ≪ T1/T∞, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the length of its span, which is optimal. We measured the performance of A-STEAL on a simulated multiprocessor system using synthetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL provides almost perfect linear speedup across a variety of processor availability profiles. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora et al. [1998] which does not employ parallelism feedback. On moderately to heavily loaded machines with large numbers of processors, A-STEAL typically completed jobs more than twice as quickly as ABP, despite being allotted the same number or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP.