Scheduling multithreaded computations by work stealing

Authors:
Robert D. Blumofe;Charles E. Leiserson
Affiliations:
Univ. of Texas at Austin, Austin;MIT Lab for Computer Science, Cambridge, MA
Venue:
Journal of the ACM (JACM)
Year:
1999

Citing 33
Cited 140

DIB—a distributed implementation of backtracking

ACM Transactions on Programming Languages and Systems (TOPLAS)
Control of parallelism in the Manchester Dataflow Machine

Proc. of a conference on Functional programming languages and computer architecture
Resource requirements of dataflow programs

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Speedup Versus Efficiency in Parallel Systems

IEEE Transactions on Computers
Workcrews: an abstraction for controlling parallelism

International Journal of Parallel Programming
I-structures: data structures for parallel computing

ACM Transactions on Programming Languages and Systems (TOPLAS)
A simple load balancing scheme for task allocation in parallel machines

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Communication complexity for parallel divide-and-conquer

SFCS '91 Proceedings of the 32nd annual symposium on Foundations of computer science
An atomic model for message-passing

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Randomized parallel algorithms for backtrack search and branch-and-bound computation

Journal of the ACM (JACM)
Provably efficient scheduling for languages with fine-grained parallelism

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Synchronized MIMD computing

Synchronized MIMD computing
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
The cilk system for parallel multithreaded computing

The cilk system for parallel multithreaded computing
Guaranteeing Good Memory Bounds for Parallel Programs

IEEE Transactions on Software Engineering
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Executing multithreaded programs efficiently

Executing multithreaded programs efficiently
Efficient detection of determinacy races in Cilk programs

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Space-efficient scheduling of parallelism with synchronization variables

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Space-Efficient Scheduling of Multithreaded Computations

SIAM Journal on Computing
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Computation-centric memory models

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Detecting data races in Cilk programs that use locks

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The Parallel Evaluation of General Arithmetic Expressions

Journal of the ACM (JACM)
Storage Management in Virtual Tree Machines

IEEE Transactions on Computers
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Dag-Consistent Distributed Shared Memory

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Implementation of multilisp: Lisp on a multiprocessor

LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
Executing functional programs on a virtual tree of processors

FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
The Performance of Work Stealing in Multiprogrammed Environments

The Performance of Work Stealing in Multiprogrammed Environments
Cilk: efficient multithreaded computing

Cilk: efficient multithreaded computing
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference

Low-contention depth-first scheduling of parallel computations with write-once synchronization variables

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Parallel interval-Newton using message passing: dynamic load balancing strategies

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
On bounding time and space for multiprocessor garbage collection

ACM SIGPLAN Notices - Best of PLDI 1979-1999
On-the-fly maintenance of series-parallel relationships in fork-join multithreaded programs

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Effectively sharing a cache among threads

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Load balancing and locality in range-queriable data structures

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Dynamic circular work-stealing deque

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Value-maximizing deadline scheduling and its application to animation rendering

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Adaptive scheduling with parallelism feedback

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
The cache complexity of multithreaded cache oblivious algorithms

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Parallel depth first vs. work stealing schedulers on CMP architectures

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
A dynamic-sized nonblocking work stealing deque

Distributed Computing - Special issue: DISC 04
Programming with exceptions in JCilk

Science of Computer Programming - Special issue: Synchronization and concurrency in object-oriented languages
Adaptive work stealing with parallelism feedback

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
MCSTL: the multi-core standard template library

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Deadlock-free scheduling of X10 computations with bounded resources

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Manticore: a heterogeneous parallel language

Proceedings of the 2007 workshop on Declarative aspects of multicore programming
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Parallel garbage collection for shared memory multiprocessors

JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
Sequencer virtualization

Proceedings of the 21st annual international conference on Supercomputing
Multithreaded programming in Cilk

Proceedings of the 2007 international workshop on Parallel symbolic computation
Status report: the manticore project

ML '07 Proceedings of the 2007 workshop on Workshop on ML
Automated dynamic redistribution of virtual operating systems under the Xen virtual machine monitor

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
WSPE: a peer-to-peer programming environment for grid-unaware applications

Proceedings of the 5th international workshop on Middleware for grid computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference
Cache-efficient dynamic programming algorithms for multicores

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Adaptive work-stealing with parallelism feedback

ACM Transactions on Computer Systems (TOCS)
A scheduling framework for general-purpose parallel languages

Proceedings of the 13th ACM SIGPLAN international conference on Functional programming
Space profiling for parallel functional programs

Proceedings of the 13th ACM SIGPLAN international conference on Functional programming
Load balancing using work-stealing for pipeline parallelism in emerging applications

Proceedings of the 23rd international conference on Supercomputing
Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Proceedings of the 36th annual international symposium on Computer architecture
Dependency-aware reordering for parallelizing query optimization in multi-core CPUs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Multicore Scheduling for Lightweight Communicating Processes

COORDINATION '09 Proceedings of the 11th International Conference on Coordination Models and Languages
Reducers and other Cilk++ hyperobjects

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Brief announcement: a lower bound for depth-restricted work stealing

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Developing, simulating, and deploying peer-to-peer systems using the Kompics component model

Proceedings of the Fourth International ICST Conference on COMmunication System softWAre and middlewaRE
The Cilk++ concurrency platform

Proceedings of the 46th Annual Design Automation Conference
HPPNetSim: a parallel simulation of large-scale interconnection networks

SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
The design of a task parallel library

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Lazy binary-splitting: a run-time adaptive work-stealing scheduler

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Helper locks for fork-join parallel programming

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Performance Evaluation of Work Stealing for Streaming Applications

OPODIS '09 Proceedings of the 13th International Conference on Principles of Distributed Systems
Distributed Scheduling of Parallel Hybrid Computations

ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
A dynamic-sized nonblocking work stealing deque

A dynamic-sized nonblocking work stealing deque
Lightweight asynchrony using parasitic threads

Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming
The Cilk++ concurrency platform

The Journal of Supercomputing
Defining and controlling the heterogeneity of a cluster: The Wrekavoc tool

Journal of Systems and Software
Provably efficient two-level adaptive scheduling

JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing
Load balancing: toward the infinite network and beyond

JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing
An adaptive task creation strategy for work-stealing scheduling

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Parallelization of bulk operations for STL dictionaries

Euro-Par'07 Proceedings of the 2007 conference on Parallel processing
The Cilkview scalability analyzer

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Brief announcement: locality-aware load balancing for speculatively-parallelized irregular applications

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Brief announcement: serial-parallel reciprocity in dynamic multithreaded languages

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Low depth cache-oblivious algorithms

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Exploiting multicore systems with Cilk

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Parallel operations of sparse polynomials on multicores: I. multiplication and Poisson bracket

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Granularity-Aware Work-Stealing for Computationally-Uniform Grids

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Using memory mapping to support cactus stacks in work-stealing runtime systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
High order finite volume methods on wavelet-adapted grids with local time-stepping on multicore architectures for the simulation of shock-bubble interactions

Journal of Computational Physics
Lazy tree splitting

Proceedings of the 15th ACM SIGPLAN international conference on Functional programming
Dynamic parallelization of recursive code: part 1: managing control flow interactions with the continuator

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Resource recycling: putting idle resources to work on a composable accelerator

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Resource oblivious sorting on multicores

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Space-efficient scheduling of stochastically generated tasks

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming: Part II
Perfect sampling of load sharing policies in large scale distributed systems

ASMTA'10 Proceedings of the 17th international conference on Analytical and stochastic modeling techniques and applications
Area-maximizing schedules for series-parallel DAGs

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Hierarchical work-stealing

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Data structures in the multicore age

Communications of the ACM
Task management for irregular-parallel workloads on the GPU

Proceedings of the Conference on High Performance Graphics
Programming in Manticore, a heterogenous parallel functional language

CEFP'09 Proceedings of the Third summer school conference on Central European functional programming school
Efficient data race detection for async-finish parallelism

RV'10 Proceedings of the First international conference on Runtime verification
Affinity driven distributed scheduling algorithm for parallel computations

ICDCN'11 Proceedings of the 12th international conference on Distributed computing and networking
Parallelization libraries: Characterizing and reducing overheads

ACM Transactions on Architecture and Code Optimization (TACO)
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Space profiling for parallel functional programs

Journal of Functional Programming
Implicitly threaded parallelism in manticore

Journal of Functional Programming
Location-based memory fences

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Work-stealing for mixed-mode parallelism by deterministic team-building

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Performance driven distributed scheduling of parallel hybrid computations

Theoretical Computer Science
Performance driven multi-objective distributed scheduling for parallel computations

ACM SIGOPS Operating Systems Review
Work stealing for multi-core HPC clusters

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Globally parallel, locally sequential: a preliminary proposal for Acumen objects

Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Combining RTSJ with Fork/Join: a priority-based model

Proceedings of the 9th International Workshop on Java Technologies for Real-Time and Embedded Systems
Adaptive runtime selection of parallel schedules in the polytope model

Proceedings of the 19th High Performance Computing Symposia
A parallel programming model for ada

SIGAda '11 Proceedings of the 2011 ACM annual international conference on Special interest group on the ada programming language
Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids

SIAM Journal on Scientific Computing
Exploring the limits of GPGPU scheduling in control flow bound applications

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Space-efficient scheduling of stochastically generated tasks

Information and Computation
Time complexity of distributed topological self-stabilization: the case of graph linearization

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Deterministic parallel random-number generation for dynamic-multithreading platforms

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A work-stealing scheduler for X10's task parallelism with suspension

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Multicore scheduling for lightweight communicating processes

Science of Computer Programming
Chapter 14: building search computing applications

Search Computing
An Intel Cilk plus based task tree executor architecture

SEPADS'12/EDUCATION'12 Proceedings of the 11th WSEAS international conference on Software Engineering, Parallel and Distributed Systems, and proceedings of the 9th WSEAS international conference on Engineering Education
A performance model for X10 applications: what's going on under the hood?

Proceedings of the 2011 ACM SIGPLAN X10 Workshop
DAG3: a tool for design and analysis of applications for multicore architectures

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Revisiting the cache miss analysis of multithreaded algorithms

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Mapping a data-flow programming model onto heterogeneous platforms

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
WSCOM: Online Task Scheduling with Data Transfers

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Server-based scheduling of parallel real-time tasks

Proceedings of the tenth ACM international conference on Embedded software
How to achieve scalable fork/join on many-core architectures?

Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
High throughput software for direct numerical simulations of compressible two-phase flows

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
MCSTL: the multi-core standard template library

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Compiler support for lightweight context switching

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Dynamic distributed scheduling algorithm for state space search

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Using load information in work-stealing on distributed systems with non-uniform communication latencies

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hierarchical partitioning algorithm for scientific computing on highly heterogeneous CPU + GPU clusters

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A new programming paradigm for GPGPU

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Tutorial: multicore programming using divide-and-conquer and work stealing

Proceedings of the 2012 ACM conference on High integrity language technology
Synchronization cannot be implemented as a library

Proceedings of the 2012 ACM conference on High integrity language technology
Efficient data race detection for async-finish parallelism

Formal Methods in System Design
Betweenness centrality: algorithms and implementations

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Correct and efficient work-stealing for weak memory models

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling parallel programs by work stealing with private deques

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Work-stealing with configurable scheduling strategies

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Message-passing concurrency for scalable, stateful, reconfigurable middleware

Proceedings of the 13th International Middleware Conference
Hardware support for fine-grained event-driven computation in Anton 2

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Design and implementation of a customizable work stealing scheduler

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
On-the-fly pipeline parallelism

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Program-centric cost models for locality

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Arbiter work stealing for parallelizing games on heterogeneous computing environments

Proceedings of the High Performance Computing Symposium
Using simulation to explore distributed key-value stores for extreme-scale system services

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Load-balanced pipeline parallelism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallel flow-sensitive pointer analysis by graph-rewriting

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
How to be a successful thief: feudal work stealing for irregular divide-and-conquer applications on heterogeneous distributed systems

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Load balancing non-uniform parallel computations

Proceedings of the 2013 workshop on Programming based on actors, agents, and decentralized control
Energy-efficient work-stealing language runtimes

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Well-structured futures and cache locality

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Provably good scheduling for parallel programs that use data structures through implicit batching

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Specification and Verification of Concurrent Programs Through Refinements

Journal of Automated Reasoning
A topology-aware load balancing algorithm for clustered hierarchical multi-core machines

Future Generation Computer Systems
Friendly barriers: efficient work-stealing with return barriers

Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Boosting CUDA Applications with CPU---GPU Hybrid Computing

International Journal of Parallel Programming

Quantified Score

Hi-index	0.05

Visualization

Abstract

This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies.Specifically, our analysis shows that the expected time to execute a fully strict computation on P processors using our work-stealing scheduler is T1/P + O(T ∞ , where T1 is the minimum serial execution time of the multithreaded computation and (T ∞ is the minimum execution time with an infinite number of processors. Moreover, the space required by the execution is at most S1P, where S1 is the minimum serial space requirement. We also show that the expected total communication of the algorithm is at most O(PT ∞ ( 1 + nd)Smax), where Smax is the size of the largest activation record of any thread and nd is the maximum number of times that any thread synchronizes with its parent. This communication bound justifies the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.