Linearizability: a correctness condition for concurrent objects
ACM Transactions on Programming Languages and Systems (TOPLAS)
ACM Transactions on Programming Languages and Systems (TOPLAS)
The SPARC architecture manual: version 8
The SPARC architecture manual: version 8
Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Proceedings of the ACM 2000 conference on Java Grande
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs)
Journal of Parallel and Distributed Computing
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Evaluating MapReduce for Multi-core and Multiprocessor Systems
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Intel threading building blocks
Intel threading building blocks
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
IEEE Transactions on Parallel and Distributed Systems
InvisiFence: performance-transparent memory ordering in conventional multiprocessors
Proceedings of the 36th annual international symposium on Computer architecture
The Art of Multiprocessor Programming
The Art of Multiprocessor Programming
x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors
Communications of the ACM
Using memory mapping to support cactus stacks in work-stealing runtime systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficient Work Stealing for Fine Grained Parallelism
ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Dynamic synthesis for relaxed memory models
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
End-to-end sequential consistency
Proceedings of the 39th Annual International Symposium on Computer Architecture
Work-stealing without the baggage
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Scheduling parallel programs by work stealing with private deques
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the 27th international ACM conference on International conference on supercomputing
Nonblocking algorithms and scalable multicore programming
Communications of the ACM
WeeFence: toward making fences free in TSO
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Work stealing is the method of choice for load balancing in task parallel programming languages and frameworks. Yet despite considerable effort invested in optimizing work stealing task queues, existing algorithms issue a costly memory fence when removing a task, and these fences are believed to be necessary for correctness. This paper refutes this belief, demonstrating work stealing algorithms in which a worker does not issue a memory fence for microarchitectures with a bounded total store ordering (TSO) memory model. Bounded TSO is a novel restriction of TSO~-- capturing mainstream x86 and SPARC TSO processors -- that bounds the number of stores a load can be reordered with. Our algorithms eliminate the memory fence penalty, improving the running time of a suite of parallel benchmarks on modern x86 multicore processors by 7%-11% on average (and up to 23%), compared to the Cilk and Chase-Lev work stealing queues.