Automatic thread distribution for nested parallelism in OpenMP

Authors:
Alejandro Duran;Marc Gonzàlez;Julita Corbalán
Affiliations:
University of Catalonia, Despatx, Barcelona, Spain;University of Catalonia, Despatx, Barcelona, Spain;University of Catalonia, Despatx, Barcelona, Spain
Venue:
Proceedings of the 19th annual international conference on Supercomputing
Year:
2005

Citing 7
Cited 10

The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors

ICS '99 Proceedings of the 13th international conference on Supercomputing
Dual-Level Parallelism Exploitation with OpenMP in Coastal Ocean Circulation Modeling

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
A Library Implementation of the Nano-Threads Programming Model

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Dynamic Load Balancing of MPI+OpenMP Applications

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Runtime adjustment of parallel nested loops

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP

A dynamic scheduler for balancing HPC applications

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
An Efficient OpenMP Runtime System for Hierarchical Architectures

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
Low-pain, high-gain multicore programming in Haskell: coordinating irregular symbolic computations on multicore architectures

Proceedings of the 4th workshop on Declarative aspects of multicore programming
Lazy binary-splitting: a run-time adaptive work-stealing scheduler

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallelism and scalability in an image processing application

International Journal of Parallel Programming
Load balancing using dynamic cache allocation

Proceedings of the 7th ACM international conference on Computing frontiers
Scheduling dynamic OpenMP applications over multicore architectures

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Portable explicit threading and concurrent programming for MPI applications

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
Concurrent programming constructs for parallel MPI applications

The Journal of Supercomputing
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

OpenMP is becoming the standard programming model for shared-memory parallel architectures. One of its most interesting features in the language is the support for nested parallelism. Previous research and parallelization experiences have shown the benefits of using nested parallelism as an alternative to combining several programming models such as MPI and OpenMP. However, all these works rely on the manual definition of an appropriate distribution of all the available thread across the different levels of parallelism. Some proposals have been made to extend the OpenMP language to allow the programmers to specify the thread distribution.This paper proposes a mechanism to dynamically compute the most appropriate thread distribution strategy. The mechanism is based on gathering information at runtime to derive the structure of the nested parallelism. This information is used to determine how the overall computation is distributed between the parallel branches in the outermost level of parallelism, which is constant in this work. According to this, threads in the innermost level of parallelism are distributed.The proposed mechanism is evaluated in two different environments: a research environment, the Nanos OpenMP research platform, and a commercial environment, the IBM XL runtime library. The performance numbers obtained validate the mechanism in both environments and they show the importance of selecting the proper amount of parallelism in the outer level.