A fast algorithm for particle simulations
Journal of Computational Physics
Lazy task creation: a technique for increasing the granularity of parallel programs
LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
StackThreads/MP: integrating futures into calling standards
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Proceedings of the ACM 2000 conference on Java Grande
A fast adaptive multipole algorithm in three dimensions
Journal of Computational Physics
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Efficient load balancing for wide-area divide-and-conquer applications
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Satin: Efficient Parallel Divide-and-Conquer in Java
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Adaptive mesh refinement for hyperbolic partial differential equations
Adaptive mesh refinement for hyperbolic partial differential equations
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Intel® threading building blocks
Journal of Computing Sciences in Colleges
An adaptive cut-off for task parallelism
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scioto: A Framework for Global-View Task Parallelism
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scheduling task parallelism on multi-socket multicore systems
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Scheduling irregular parallel computations on hierarchical caches
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Characterizing and mitigating work time inflation in task parallel programs
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Work-stealing with configurable scheduling strategies
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
An efficient scheduler is important for task parallelism. It should provide scalable dynamic load-balancing mechanism among CPU cores. To meet this requirement, most runtime systems for task parallelism use work stealing as scheduling strategy. Work stealing schedulers typically steal work randomly. This strategy does not consider hardware specific knowledge such as memory hierarchy or application specific knowledge such as cache usage. In order to execute tasks more efficiently, work stealing schedulers should take such knowledge into account. To this end, we propose an API that can customize scheduling strategies and take hardware and application specific knowledge into account while preserving the desirable properties of work stealing. This paper describes the design of our proposed API. Specifically, it provides mechanisms to give scheduling hints for tasks and to implement user-defined work stealing functions. They enable programmers to implement a work stealing strategy optimized for their applications. This paper also presents preliminary evaluation results of the proposed API. A kernel of STREAM microbenchmark improved by 58.8% with a work stealing strategy utilizing data cached by the previous iteration. Performance of matrix multiply improved by 18.2% on 32 AMD cores by a work stealing strategy that tries to steal as a coarse grained task as possible.