Scalability Evaluation of Barrier Algorithms for OpenMP

Authors:
Ramachandra Nanjegowda;Oscar Hernandez;Barbara Chapman;Haoqiang H. Jin
Affiliations:
Computer Science Department, University of Houston, Houston, USA 77004;Computer Science Department, University of Houston, Houston, USA 77004;Computer Science Department, University of Houston, Houston, USA 77004;NASA Ames Research Center, Moffett Field, CA 94035
Venue:
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Year:
2009

Citing 0
Cited 9

TLSync: support for multiple fast barriers using on-chip transmission lines

Proceedings of the 38th annual international symposium on Computer architecture
A runtime implementation of OpenMP tasks

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Enabling low-overhead hybrid MPI/OpenMP parallelism with MPC

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
OpenMP parallelism for fluid and fluid-particulate systems

Parallel Computing
Adaptive OpenMP for large NUMA nodes

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
libEOMP: a portable OpenMP runtime library based on MCA APIs for embedded systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Experimenting with low-overhead OpenMP runtime on IBM Blue Gene/Q

IBM Journal of Research and Development
Experiences Developing the OpenUH Compiler and Runtime Infrastructure

International Journal of Parallel Programming
Reduction Operations in Parallel Loops for GPGPUs

Proceedings of Programming Models and Applications on Multicores and Manycores

Quantified Score

Hi-index	0.00

Visualization

Abstract

OpenMP relies heavily on barrier synchronization to coordinate the work of threads that are performing the computations in a parallel region. A good implementation of barriers is thus an important part of any implementation of this API. As the number of cores in shared and distributed shared memory machines continues to grow, the quality of the barrier implementation is critical for application scalability. There are a number of known algorithms for providing barriers in software. In this paper, we consider some of the most widely used approaches for implementing barriers on large-scale shared-memory multiprocessor systems: a "blocking" implementation that de-schedules a waiting thread, a "centralized" busy wait and three forms of distributed "busy" wait implementations are discussed. We have implemented the barrier algorithms in the runtime library associated with a research compiler, OpenUH. We first compare the impact of these algorithms on the overheads incurred for OpenMP constructs that involve a barrier, possibly implicitly. We then show how the different barrier implementations influence the performance of two different OpenMP application codes.