Four-Ary Tree-Based Barrier Synchronization for 2D Meshes without Nonmember Involvement

Authors:
Sangman Moh;Chansu Yu;Ben Lee;Hee Young Youn;Dongsoo Han;Dongman Lee
Affiliations:
Information and Communications Univ., Taejon, Korea;Information and Communications Univ., Taejon, Korea;Oregon State Univ., Corvallis;Sungkyunkwan Univ., Suwon, Korea;Information and Communications Univ., Taejon, Korea;Information and Communications Univ., Taejon, Korea
Venue:
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Year:
2001

Citing 21
Cited 2

Deadlock-Free Message Routing in Multiprocessor Interconnection Networks

IEEE Transactions on Computers
The fuzzy barrier: a mechanism for high speed synchronization of processors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
A scalable implementation of barrier synchronization using an adaptive combining tree

International Journal of Parallel Programming
Fast barrier synchronization hardware

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Unicast-Based Multicast Communication in Wormhole-Routed Networks

IEEE Transactions on Parallel and Distributed Systems
Parallel programming with MPI

Parallel programming with MPI
A Cost and Speed Model for k-ary n-Cube Wormhole Routers

IEEE Transactions on Parallel and Distributed Systems
Designing Tree-Based Barrier Synchronization on 2D Mesh Networks

IEEE Transactions on Parallel and Distributed Systems
Wormhole routing techniques for directly connected multicomputer systems

ACM Computing Surveys (CSUR)
Efficient techniques for nested and disjoint barrier synchronization

Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
High Performance Cluster Computing: Programming and Applications

High Performance Cluster Computing: Programming and Applications
High Performance Cluster Computing: Architectures and Systems

High Performance Cluster Computing: Architectures and Systems
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Interconnection Networks: An Engineering Approach

Interconnection Networks: An Engineering Approach
Assessing the Performance of the New IBM SP2 Communication Subsystem

IEEE Parallel & Distributed Technology: Systems & Technology
A Survey of Wormhole Routing Techniques in Direct Networks

Computer
Collective Communication in Wormhole-Routed Massively Parallel Computers

Computer
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Deadlock-Free Multicast Wormhole Routing in 2-D Mesh Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Simulation Studies of Gigabit Ethernet Versus Myrinet Using Real Application Cores

CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

Tiered Algorithm for Distributed Process Quiescence and Termination Detection

IEEE Transactions on Parallel and Distributed Systems
Exploiting single-assignment properties to optimize message-passing programs by code transformations

IFL'04 Proceedings of the 16th international conference on Implementation and Application of Functional Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a Barrier Tree for Meshes (BTM) to minimize the barrier synchronization latency for two-dimensional (2D) meshes. The proposed BTM scheme has two distinguishing features. First, the synchronization tree is 4-ary. The synchronization latency of the BTM scheme is asymptotically $\Theta (\log_{4} n)$, while that of the fastest scheme reported in the literature is bounded between $\Omega (\log_{3} n)$ and $O (n^{1/2})$, where $n$ is the number of member nodes. Second, nonmember nodes are neither involved in the construction of a BTM nor actively participate in the synchronization operations, which avoids interference among different process groups during synchronization. This not only results in low setup overhead, but also reduces the synchronization latency. The low setup overhead is particularly effective for the dynamic process model provided in MPI-2. Extensive simulation study shows that, for up to $64 \times 64$ meshes, the BTM scheme results in about $40 \sim 70$ percent shorter synchronization latency and is more scalable than conventional schemes.