Friendly barriers: efficient work-stealing with return barriers

Authors:
Vivek Kumar;Stephen M. Blackburn;David Grove
Affiliations:
Australian National University, Canberra, Australia;Australian National University, Canberra, Australia;IBM T.J. Watson Research, New York, NY, USA
Venue:
Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Year:
2014

Citing 31
Cited 0

Real-time garbage collection on general-purpose machines

Journal of Systems and Software
Lazy task creation: a technique for increasing the granularity of parallel programs

LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
A simple load balancing scheme for task allocation in parallel machines

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Debugging optimized code with dynamic deoptimization

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
A dynamic distributed load balancing algorithm with provable good performance

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Analyses of load stealing models based on differential equations

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
A Java fork/join framework

Proceedings of the ACM 2000 conference on Java Grande
Adaptive optimization in the Jalapeño JVM

OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Non-blocking steal-half work queues

Proceedings of the twenty-first annual symposium on Principles of distributed computing
Design, implementation and evaluation of adaptive recompilation with on-stack replacement

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The Natural Work-Stealing Algorithm is Stable

SIAM Journal on Computing
The Jalapeño virtual machine

IBM Systems Journal
Real-Time GC in JeRTy"VM Using the Return-Barrier Method

ISORC '05 Proceedings of the Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler

Proceedings of the International Symposium on Code Generation and Optimization
Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Intel threading building blocks

Intel threading building blocks
A lock-free, concurrent, and incremental stack scanning for garbage collectors

Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Work-first and help-first scheduling policies for async-finish task parallelism

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
SLAW: a scalable locality-aware adaptive work-stealing scheduler for multi-core systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
UTS: an unbalanced tree search benchmark

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Delegated isolation

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Habanero-Java: the new adventures of old X10

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Practical permissions for race-free parallelism

ECOOP'12 Proceedings of the 26th European conference on Object-Oriented Programming
Work-stealing without the baggage

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Scheduling parallel programs by work stealing with private deques

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Isolation for nested task parallelism

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%. In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim's stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intra-node hardware parallelism.