Collective communication patterns on the quadrics network

  • Authors:
  • Salvador Coll;José Duato;Francisco J. Mora;Fabrizio Petrini;Adolfy Hoisie

  • Affiliations:
  • Department of Electronic Engineering, Technical University of Valencia, Valencia, Spain;Department of Electronic Engineering, Technical University of Valencia, Valencia, Spain;Department of Electronic Engineering, Technical University of Valencia, Valencia, Spain;Performance and Architecture Laboratory, CCS-3, Los Alamos National Laboratory, Los Alamos, NM;Performance and Architecture Laboratory, CCS-3, Los Alamos National Laboratory, Los Alamos, NM

  • Venue:
  • Performance analysis and grid computing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The efficient implementation of collective communication is a key factor to provide good performance and scalability of communication patterns that involve global data movement and global control. Moreover, this is essential to enhance the fault-tolerance of a parallel computer. For instance, to check the status of the nodes, perform some distributed algorithm to balance the load, synchronize the local clocks, or do performance monitoring. Therefore, the support for multicast communications can improve the performance and resource utilization of a parallel computer. The Quadrics interconnect (QsNET), which is being used in some of the largest machines in the world, provides hardware support for multicast. The basic mechanism consists of the capability for a message to be sent to any set of contiguous nodes in the same time it takes to send a unicast message. The two main collective communication primitives provided by the network software are the barrier synchronization and the broadcast, which are both implemented in two different ways, either using the hardware support, when nodes are contiguous, or a balanced tree and unicast messaging, otherwise. In this paper some performance results are given for the above collective communication services, that show, on the one hand, the outstanding performance of the hardware-based primitives even in the presence of a high network background traffic; and, on the other hand, the limited performance achieved with the software-based implementation.