Fence-free work stealing on bounded TSO processors

Authors:
Adam Morrison;Yehuda Afek
Affiliations:
Technion -- Israel Institute of Technology, Haifa, Israel;Tel Aviv University, Tel Aviv, Israel
Venue:
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Year:
2014

Citing 27
Cited 0

Linearizability: a correctness condition for concurrent objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
Wait-free synchronization

ACM Transactions on Programming Languages and Systems (TOPLAS)
The SPARC architecture manual: version 8

The SPARC architecture manual: version 8
Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A Java fork/join framework

Proceedings of the ACM 2000 conference on Java Grande
Dynamic circular work-stealing deque

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs)

Journal of Parallel and Distributed Computing
Mechanisms for store-wait-free multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Intel threading building blocks

Intel threading building blocks
Idempotent work stealing

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
The Design of OpenMP Tasks

IEEE Transactions on Parallel and Distributed Systems
InvisiFence: performance-transparent memory ordering in conventional multiprocessors

Proceedings of the 36th annual international symposium on Computer architecture
The Art of Multiprocessor Programming

The Art of Multiprocessor Programming
x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors

Communications of the ACM
Using memory mapping to support cactus stacks in work-stealing runtime systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficient Work Stealing for Fine Grained Parallelism

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated

Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Dynamic synthesis for relaxed memory models

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
End-to-end sequential consistency

Proceedings of the 39th Annual International Symposium on Computer Architecture
Work-stealing without the baggage

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Scheduling parallel programs by work stealing with private deques

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Address-aware fences

Proceedings of the 27th international ACM conference on International conference on supercomputing
Nonblocking algorithms and scalable multicore programming

Communications of the ACM
WeeFence: toward making fences free in TSO

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Work stealing is the method of choice for load balancing in task parallel programming languages and frameworks. Yet despite considerable effort invested in optimizing work stealing task queues, existing algorithms issue a costly memory fence when removing a task, and these fences are believed to be necessary for correctness. This paper refutes this belief, demonstrating work stealing algorithms in which a worker does not issue a memory fence for microarchitectures with a bounded total store ordering (TSO) memory model. Bounded TSO is a novel restriction of TSO~-- capturing mainstream x86 and SPARC TSO processors -- that bounds the number of stores a load can be reordered with. Our algorithms eliminate the memory fence penalty, improving the running time of a suite of parallel benchmarks on modern x86 multicore processors by 7%-11% on average (and up to 23%), compared to the Cilk and Chase-Lev work stealing queues.