Deadlock-Free Message Routing in Multiprocessor Interconnection Networks
IEEE Transactions on Computers
The fuzzy barrier: a mechanism for high speed synchronization of processors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
A scalable implementation of barrier synchronization using an adaptive combining tree
International Journal of Parallel Programming
Fast barrier synchronization hardware
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Unicast-Based Multicast Communication in Wormhole-Routed Networks
IEEE Transactions on Parallel and Distributed Systems
Parallel programming with MPI
A Cost and Speed Model for k-ary n-Cube Wormhole Routers
IEEE Transactions on Parallel and Distributed Systems
Designing Tree-Based Barrier Synchronization on 2D Mesh Networks
IEEE Transactions on Parallel and Distributed Systems
Wormhole routing techniques for directly connected multicomputer systems
ACM Computing Surveys (CSUR)
Efficient techniques for nested and disjoint barrier synchronization
Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
High Performance Cluster Computing: Programming and Applications
High Performance Cluster Computing: Programming and Applications
High Performance Cluster Computing: Architectures and Systems
High Performance Cluster Computing: Architectures and Systems
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Interconnection Networks: An Engineering Approach
Interconnection Networks: An Engineering Approach
Assessing the Performance of the New IBM SP2 Communication Subsystem
IEEE Parallel & Distributed Technology: Systems & Technology
Deadlock-Free Multicast Wormhole Routing in 2-D Mesh Multicomputers
IEEE Transactions on Parallel and Distributed Systems
Simulation Studies of Gigabit Ethernet Versus Myrinet Using Real Application Cores
CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Tiered Algorithm for Distributed Process Quiescence and Termination Detection
IEEE Transactions on Parallel and Distributed Systems
Exploiting single-assignment properties to optimize message-passing programs by code transformations
IFL'04 Proceedings of the 16th international conference on Implementation and Application of Functional Languages
Hi-index | 0.00 |
This paper proposes a Barrier Tree for Meshes (BTM) to minimize the barrier synchronization latency for two-dimensional (2D) meshes. The proposed BTM scheme has two distinguishing features. First, the synchronization tree is 4-ary. The synchronization latency of the BTM scheme is asymptotically $\Theta (\log_{4} n)$, while that of the fastest scheme reported in the literature is bounded between $\Omega (\log_{3} n)$ and $O (n^{1/2})$, where $n$ is the number of member nodes. Second, nonmember nodes are neither involved in the construction of a BTM nor actively participate in the synchronization operations, which avoids interference among different process groups during synchronization. This not only results in low setup overhead, but also reduces the synchronization latency. The low setup overhead is particularly effective for the dynamic process model provided in MPI-2. Extensive simulation study shows that, for up to $64 \times 64$ meshes, the BTM scheme results in about $40 \sim 70$ percent shorter synchronization latency and is more scalable than conventional schemes.