Exploiting heterogeneous parallelism on a multithreaded multiprocessor
ICS '92 Proceedings of the 6th international conference on Supercomputing
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Thread scheduling for multiprogrammed multiprocessors
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Provably efficient scheduling for languages with fine-grained parallelism
Journal of the ACM (JACM)
ATLAS: an infrastructure for global computing
EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Satin: Efficient Parallel Divide-and-Conquer in Java
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A dynamic-sized nonblocking work stealing deque
Distributed Computing - Special issue: DISC 04
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors
Proceedings of the 2007 international workshop on Parallel symbolic computation
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Support for OpenMP tasks in Nanos v4
CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Scheduling multithreaded computations by work stealing
SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
An adaptive cut-off for task parallelism
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
IEEE Transactions on Parallel and Distributed Systems
The design of a task parallel library
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Evaluation of OpenMP task scheduling strategies
IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Scheduling task parallelism on multi-socket multicore systems
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Lock-free and practical doubly linked list-based deques using single-word compare-and-swap
OPODIS'04 Proceedings of the 8th international conference on Principles of Distributed Systems
A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries
IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
LIBKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Adaptive granularity control in task parallel programs using multiversioning
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hi-index | 0.00 |
The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system. Efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteristics. In order to evaluate scheduling strategies, we extended the open source Qthreads threading library to implement different scheduler designs, accepting OpenMP programs through the ROSE compiler. Our comprehensive performance study of diverse OpenMP task-parallel benchmarks compares seven different task-parallel run-time scheduler implementations on an Intel Nehalem multi-socket multicore system: our proposed hierarchical work-stealing scheduler, a per-core work-stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the Qthreads round-robin scheduler. In addition, we compare our results against the Intel and GNU OpenMP implementations.Our hierarchical scheduling strategy leverages different scheduling methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, the scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well as between a parent task and its newly created child tasks. In the performance evaluation, our Qthreads hierarchical scheduler is competitive on all benchmarks tested. On five of the seven benchmarks, it demonstrates speedup and absolute performance superior to both the Intel and GNU OpenMP run-time systems. Our run-time also demonstrates similar performance benefits on AMD Magny Cours and SGI Altix systems, enabling several benchmarks to successfully scale to 192 CPUs of an SGI Altix.