A performance analysis of local synchronization

Authors:
Julia Lipman;Quentin F. Stout
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI
Venue:
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2006

Citing 6
Cited 2

Enumerative combinatorics

Enumerative combinatorics
Bounds on the speedup and efficiency of partial synchronization in parallel processing systems

Journal of the ACM (JACM)
Reducing synchronization overhead in parallel simulation

PADS '96 Proceedings of the tenth workshop on Parallel and distributed simulation
Eliminating barrier synchronization for compiler-parallelized codes on software DSMs

International Journal of Parallel Programming - Special issue on languages and compilers for parallel computing. Part I
On the Performance of Synchronized Programs in Distributed Networks with Random Processing Times and Transmission Delays

IEEE Transactions on Parallel and Distributed Systems
Stochastic Modeling of Scaled Parallel Programs

Proceedings of the 1994 International Conference on Parallel and Distributed Systems

The Impact of noise on the scaling of collectives: the nearest neighbor model

HiPC'07 Proceedings of the 14th international conference on High performance computing
On bottleneck analysis in stochastic stream processing

ACM Transactions on Design Automation of Electronic Systems (TODAES)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Synchronization is often necessary in parallel computing, but it can create delays whenever the receiving processor is idle, waiting for the information to arrive. This is especially true for barrier, or global, synchronization, in which every processor must synchronize with every other processor. Nonetheless, barriers are the only form of synchronization explicitly supplied in MPI and OpenMP.Many applications do not actually require global synchronization; local synchronization, in which a processor synchronizes only with those processors from which it has an incoming edge in some directed graph, is often adequate. However, the behavior of a system under local synchronization is more difficult to analyze, since processors do not all start tasks at the same time.In this paper, we show that if the synchronization graph is a directed cycle and the task times are geometrically distributed with p = 0.5, the time it takes for a processor to complete a task, including synchronization time, approaches an exact limit of 2 + √2 as the number of processors in the cycle approaches infinity. Under global synchronization, however, the time is unbounded, increasing logarithmically with the number of processors. Similar results also apply for p ≠ 0.5.We give a new proof of the constant upper bounds that apply when tasks are normally distributed and the synchronization graph is any graph of bounded degree. We also prove that for some power-law distributions on the tasks, there is no constant upper bound as the number of processors increases, even for the directed cycle. Finally, we show that constant upper bounds apply for some cases of a different synchronization model in which a processor waits for only a subset of its neighbors.