Proceedings of the 8th ACM International Conference on Computing Frontiers
Multi-Core BDD Operations for Symbolic Reachability
Electronic Notes in Theoretical Computer Science (ENTCS)
Fence-free work stealing on bounded TSO processors
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing
Proceedings of Programming Models and Applications on Multicores and Manycores
Hi-index | 0.00 |
This paper deals with improving the performance of fine grain task parallelism. It is often either cumbersome or impossible to increase the grain size of such programs. Increasing core counts exacerbates the problem; a program that appears coarse-grained on eight cores may well look a lot more fine-grained on sixty four. In this paper we present the direct task stack, a novel work stealing algorithm with unusually low overheads, both for creating tasks and for stealing. We compare the performance of our scheduler to Cilk++, the icc implementation of OpenMP 3.0 and the Intel TBB library on an eight core, dual socket Opteron machine. We also analyze the reasons why our techniques achieve consistent speed ups over the other systems ranging from 2-3x on many fine grained workloads to over 50 in extreme cases and show quantitatively how each of the techniques we use contribute to the improved performance.