Scheduling dynamic OpenMP applications over multicore architectures

Authors:
François Broquedis;François Diakhaté;Samuel Thibault;Olivier Aumage;Raymond Namyst;Pierre-André Wacrenier
Affiliations:
INRIA Futurs, LaBRI, Université Bordeaux 1, France;INRIA Futurs, LaBRI, Université Bordeaux 1, France;INRIA Futurs, LaBRI, Université Bordeaux 1, France;INRIA Futurs, LaBRI, Université Bordeaux 1, France;INRIA Futurs, LaBRI, Université Bordeaux 1, France;INRIA Futurs, LaBRI, Université Bordeaux 1, France
Venue:
IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Year:
2008

Citing 14
Cited 3

The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors

ICS '99 Proceedings of the 13th international conference on Supercomputing
Compiler and Runtime Support for Running OpenMP Programs on Pentium-and Itanium-Architectures

HIPS '03 Proceedings of the Eighth International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS'03)
Multi-level partition of unity implicits

ACM SIGGRAPH 2003 Papers
Automatic thread distribution for nested parallelism in OpenMP

Proceedings of the 19th annual international conference on Supercomputing
Practical Compiler Techniques on Efficient Multithreaded Code Generation for OpenMP Programs

The Computer Journal
A compiler for exploiting nested parallelism in OpenMP programs

Parallel Computing - OpenMp
Load balancing and OpenMP implementation of nested parallelism

Parallel Computing - OpenMp
Nested OpenMP for efficient computation of 3D critical points in multi-block CFD datasets

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Nested parallelization with OpenMP

International Journal of Parallel Programming
An introduction to Balder: an OpenMP run-time library for clusters of SMPs

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Hierarchical multithreading: programming model and system software

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Runtime adjustment of parallel nested loops

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
Nested parallelism in the OMPI OpenmP/C compiler

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Exploiting thread-data affinity in OpenMP with data access patterns

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
How OpenMP applications get more benefit from many-core era

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approaching the theoretical performance of hierarchical multicoremachines requires a very careful distribution of threads and dataamong the underlying non-uniform architecture in order to minimizecache misses and NUMA penalties. While it is acknowledged thatOpenMP can enhance the quality of thread scheduling on such architecturesin a portable way, by transmitting precious information aboutthe affinities between threads and data to the underlying runtime system,most OpenMP runtime systems are actually unable to efficiently supporthighly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to theexecution of OpenMP programs featuring irregular and massive nestedparallelism over hierarchical architectures. Our policy enforces a distributionof threads that maximizes the proximity of threads belonging tothe same parallel region, and uses a NUMA-aware work stealing strategywhen load balancing is needed. It has been developed as a plug-in tothe forestGOMP OpenMP platform [TBG+07]. We demonstrate theefficiency of our approach with a highly irregular recursive OpenMP programresulting from the generic parallelization of a surface reconstructionapplication. We achieve a speedup of 14 on a 16-core machine with noapplication-level optimization.