The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors
ICS '99 Proceedings of the 13th international conference on Supercomputing
Compiler and Runtime Support for Running OpenMP Programs on Pentium-and Itanium-Architectures
HIPS '03 Proceedings of the Eighth International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS'03)
Multi-level partition of unity implicits
ACM SIGGRAPH 2003 Papers
Automatic thread distribution for nested parallelism in OpenMP
Proceedings of the 19th annual international conference on Supercomputing
A compiler for exploiting nested parallelism in OpenMP programs
Parallel Computing - OpenMp
Load balancing and OpenMP implementation of nested parallelism
Parallel Computing - OpenMp
Nested OpenMP for efficient computation of 3D critical points in multi-block CFD datasets
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Nested parallelization with OpenMP
International Journal of Parallel Programming
An introduction to Balder: an OpenMP run-time library for clusters of SMPs
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Hierarchical multithreading: programming model and system software
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Runtime adjustment of parallel nested loops
WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
Nested parallelism in the OMPI OpenmP/C compiler
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Exploiting thread-data affinity in OpenMP with data access patterns
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
How OpenMP applications get more benefit from many-core era
IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Hi-index | 0.00 |
Approaching the theoretical performance of hierarchical multicoremachines requires a very careful distribution of threads and dataamong the underlying non-uniform architecture in order to minimizecache misses and NUMA penalties. While it is acknowledged thatOpenMP can enhance the quality of thread scheduling on such architecturesin a portable way, by transmitting precious information aboutthe affinities between threads and data to the underlying runtime system,most OpenMP runtime systems are actually unable to efficiently supporthighly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to theexecution of OpenMP programs featuring irregular and massive nestedparallelism over hierarchical architectures. Our policy enforces a distributionof threads that maximizes the proximity of threads belonging tothe same parallel region, and uses a NUMA-aware work stealing strategywhen load balancing is needed. It has been developed as a plug-in tothe forestGOMP OpenMP platform [TBG+07]. We demonstrate theefficiency of our approach with a highly irregular recursive OpenMP programresulting from the generic parallelization of a surface reconstructionapplication. We achieve a speedup of 14 on a 16-core machine with noapplication-level optimization.