Enumerative combinatorics
Bounds on the speedup and efficiency of partial synchronization in parallel processing systems
Journal of the ACM (JACM)
Reducing synchronization overhead in parallel simulation
PADS '96 Proceedings of the tenth workshop on Parallel and distributed simulation
Eliminating barrier synchronization for compiler-parallelized codes on software DSMs
International Journal of Parallel Programming - Special issue on languages and compilers for parallel computing. Part I
IEEE Transactions on Parallel and Distributed Systems
Stochastic Modeling of Scaled Parallel Programs
Proceedings of the 1994 International Conference on Parallel and Distributed Systems
The Impact of noise on the scaling of collectives: the nearest neighbor model
HiPC'07 Proceedings of the 14th international conference on High performance computing
On bottleneck analysis in stochastic stream processing
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Hi-index | 0.00 |
Synchronization is often necessary in parallel computing, but it can create delays whenever the receiving processor is idle, waiting for the information to arrive. This is especially true for barrier, or global, synchronization, in which every processor must synchronize with every other processor. Nonetheless, barriers are the only form of synchronization explicitly supplied in MPI and OpenMP.Many applications do not actually require global synchronization; local synchronization, in which a processor synchronizes only with those processors from which it has an incoming edge in some directed graph, is often adequate. However, the behavior of a system under local synchronization is more difficult to analyze, since processors do not all start tasks at the same time.In this paper, we show that if the synchronization graph is a directed cycle and the task times are geometrically distributed with p = 0.5, the time it takes for a processor to complete a task, including synchronization time, approaches an exact limit of 2 + √2 as the number of processors in the cycle approaches infinity. Under global synchronization, however, the time is unbounded, increasing logarithmically with the number of processors. Similar results also apply for p ≠ 0.5.We give a new proof of the constant upper bounds that apply when tasks are normally distributed and the synchronization graph is any graph of bounded degree. We also prove that for some power-law distributions on the tasks, there is no constant upper bound as the number of processors increases, even for the directed cycle. Finally, we show that constant upper bounds apply for some cases of a different synchronization model in which a processor waits for only a subset of its neighbors.