Ultracomputers: a teraflop before its time
Communications of the ACM
Implicit coscheduling: coordinated scheduling with implicit information in distributed systems
ACM Transactions on Computer Systems (TOCS)
Interconnection Networks: An Engineering Approach
Interconnection Networks: An Engineering Approach
Improved Utilization and Responsiveness with Gang Scheduling
IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
STORM: lightning-fast resource management
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Autopilot: Adaptive Control of Distributed Applications
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Efficient Broadcast and Multicast on Multistage Interconnnection Networks using Multiport Encoding
SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Buffered Coscheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Hardware- and Software-Based Collective Communication on the Quadrics Network
NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)
Scalable Hardware-Based Multicast Trees
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
The efficient implementation of collective communication is a key factor to provide good performance and scalability of communication patterns that involve global data movement and global control. Moreover, this is essential to enhance the fault-tolerance of a parallel computer. For instance, to check the status of the nodes, perform some distributed algorithm to balance the load, synchronize the local clocks, or do performance monitoring. Therefore, the support for multicast communications can improve the performance and resource utilization of a parallel computer. The Quadrics interconnect (QsNET), which is being used in some of the largest machines in the world, provides hardware support for multicast. The basic mechanism consists of the capability for a message to be sent to any set of contiguous nodes in the same time it takes to send a unicast message. The two main collective communication primitives provided by the network software are the barrier synchronization and the broadcast, which are both implemented in two different ways, either using the hardware support, when nodes are contiguous, or a balanced tree and unicast messaging, otherwise. In this paper some performance results are given for the above collective communication services, that show, on the one hand, the outstanding performance of the hardware-based primitives even in the presence of a high network background traffic; and, on the other hand, the limited performance achieved with the software-based implementation.