Scheduling task parallelism on multi-socket multicore systems

Authors:
Stephen L. Olivier;Allan K. Porterfield;Kyle B. Wheeler;Jan F. Prins
Affiliations:
University of North Carolina at Chapel Hill, Chapel Hill, NC;Renaissance Computing Institute (RENCI), Europa Drive, Suite, Chapel Hill, NC;Scalable System Software, Sandia National Laboratories Albuqurque, NM;University of North Carolina at Chapel Hill, Chapel Hill, NC
Venue:
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Year:
2011

Citing 19
Cited 13

Exploiting heterogeneous parallelism on a multithreaded multiprocessor

ICS '92 Proceedings of the 6th international conference on Supercomputing
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
ATLAS: an infrastructure for global computing

EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Satin: Efficient Parallel Divide-and-Conquer in Java

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors

Proceedings of the 2007 international workshop on Parallel symbolic computation
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
Support for OpenMP tasks in Nanos v4

CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Scheduling multithreaded computations by work stealing

SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
An adaptive cut-off for task parallelism

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The Design of OpenMP Tasks

IEEE Transactions on Parallel and Distributed Systems
The design of a task parallel library

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Evaluation of OpenMP task scheduling strategies

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Hierarchical work-stealing

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more

OpenMP task scheduling strategies for multicore NUMA systems

International Journal of High Performance Computing Applications
CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures

Proceedings of the 26th ACM international conference on Supercomputing
Global Futures: A Multithreaded Execution Model for Global Arrays-based Applications

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
LIBKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Assessing OpenMP tasking implementations on NUMA architectures

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Characterizing and mitigating work time inflation in task parallel programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Task-parallel programming on NUMA architectures

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Task scheduling on manycore processors with home caches

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Design and implementation of a customizable work stealing scheduler

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
An early prototype of an autonomic performance environment for exascale

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
A synthetic task model for HPC-grade optical network performance evaluation

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
A topology-aware load balancing algorithm for clustered hierarchical multi-core machines

Future Generation Computer Systems
Characterizing and mitigating work time inflation in task parallel programs

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run time system. This is a welcome development for scientific computing as supercomputer nodes grow "fatter" with multicore and manycore processors. But efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and NUMA characteristics. In this paper, we propose a hierarchical scheduling strategy that leverages different methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, our scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well between a parent task and its newly created child tasks. We extended the open-source Qthreads threading library to implement our scheduler, accepting OpenMP programs through the ROSE compiler. We also present a comprehensive performance study of diverse OpenMP task parallel benchmarks, comparing seven different task parallel run time scheduler implementations on current generation multi-socket multicore systems: our hierarchical work stealing scheduler, a fully-distributed work stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the original Qthreads fully-distributed scheduler. In addition, we compare our results against OpenMP implementations from Intel and GCC. Hierarchical scheduling in Qthreads is competitive on all benchmarks. On several benchmarks, hierarchical scheduling in Qthreads demonstrates speedup and absolute performance superior to both the Intel and GCC OpenMP run time systems.