A comparison of parallel algorithms for connected components
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Thread scheduling for multiprogrammed multiprocessors
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Journal of Parallel and Distributed Computing
Non-blocking steal-half work queues
Proceedings of the twenty-first annual symposium on Principles of distributed computing
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A dynamic-sized nonblocking work stealing deque
Distributed Computing - Special issue: DISC 04
A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs)
Journal of Parallel and Distributed Computing
Parallel garbage collection for shared memory multiprocessors
JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
A new approach to parallelising tracing algorithms
Proceedings of the 2009 international symposium on Memory management
Experience with Model Checking Linearizability
Proceedings of the 16th International SPIN Workshop on Model Checking Software
The design of a task parallel library
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Lazy binary-splitting: a run-time adaptive work-stealing scheduler
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
On the verification problem for weak memory models
Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
An adaptive task creation strategy for work-stealing scheduling
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
PHALANX: parallel checking of expressive heap assertions
Proceedings of the 2010 international symposium on Memory management
Simplifying concurrent algorithms by exploiting hardware transactional memory
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Synthesizing concurrent schedulers for irregular algorithms
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Automatic inference of memory fences
Proceedings of the 2010 Conference on Formal Methods in Computer-Aided Design
Virtual base station pool: towards a wireless network cloud for radio access networks
Proceedings of the 8th ACM International Conference on Computing Frontiers
Incorrect systems: it's not the problem, it's the solution
Proceedings of the 49th Annual Design Automation Conference
Dynamic synthesis for relaxed memory models
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures
Proceedings of the 26th ACM international conference on Supercomputing
LIBKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Performance, scalability, and semantics of concurrent FIFO queues
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Library abstraction for C/C++ concurrency
POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Quantitative relaxation of concurrent data structures
POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Correct and efficient work-stealing for weak memory models
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling parallel programs by work stealing with private deques
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Freeze after writing: quasi-deterministic parallel programming with LVars
Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
Energy-efficient work-stealing language runtimes
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Fence-free work stealing on bounded TSO processors
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core Architectures
Proceedings of Programming Models and Applications on Multicores and Manycores
Hi-index | 0.00 |
Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Work-stealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and occasionally steals items from the sets of other threads. The conventional semantics of work stealing guarantee that each inserted task is eventually extracted exactly once. However, correctness of a wide class of applications allows for relaxed semantics, because either: i) the application already explicitly checks that no work is repeated or ii) the application can tolerate repeated work. In this paper, we introduce idempotent work tealing, and present several new algorithms that exploit the relaxed semantics to deliver better performance. The semantics of the new algorithms guarantee that each inserted task is eventually extracted at least once-instead of exactly once. On mainstream processors, algorithms for conventional work stealing require special atomic instructions or store-load memory ordering fence instructions in the owner's critical path operations. In general, these instructions are substantially slower than regular memory access instructions. By exploiting the relaxed semantics, our algorithms avoid these instructions in the owner's operations. We evaluated our algorithms using common graph problems and micro-benchmarks and compared them to well-known conventional work stealing algorithms, the THE Cilk and Chase-Lev algorithms. We found that our best algorithm (with LIFO extraction) outperforms existing algorithms in nearly all cases, and often by significant margins.