Hardware- and Software-Based Collective Communication on the Quadrics Network

Authors:
Fabrizio Petrini;Salvador Coll;Eitan Frachtenberg;Adolfy Hoisie
Affiliations:
-;-;-;-
Venue:
NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)
Year:
2001

Citing 0
Cited 14

The Quadrics Network: High-Performance Clustering Technology

IEEE Micro
Performance Evaluation of I/O Traffic and Placement of I/O Nodes on a High Performance Network

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Collective communication patterns on the quadrics network

Performance analysis and grid computing
Scalable Hardware-Based Multicast Trees

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
QsNetII: Defining High-Performance Network Design

IEEE Micro
Optimizing All-to-All Collective Communication by Exploiting Concurrency in Modern Networks

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimizing Strided Remote Memory Access Operations on the Quadrics QsNetII Network Interconnect

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
MPI collective algorithm selection and quadtree encoding

Parallel Computing
NIC-based reduction algorithms for large-scale clusters

International Journal of High Performance Computing and Networking
Efficiency and scalability of barrier synchronization on NoC based many-core architectures

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
HPPNetSim: a parallel simulation of large-scale interconnection networks

SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
Scalability of relaxed consistency models in NoC based multicore architectures

ACM SIGARCH Computer Architecture News
Optimised gather collectives on QsNetII

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Exploiting atomic operations for barrier on cray XE/XK systems

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics network supports both hardware- and software-based collectives. We describe the main features of the two building blocks of this network, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory.Experimental results conducted on 64-node AlphaServer cluster indicate that the time to complete the hardware-based barrier synchronization on the whole network is as low as 6 microsecs, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 microsecs. With the broadcast, similar performance is achieved by the hardware- and software-based implementations, which can deliver messages of up to 256 bytes in 13 microsecs and can get a sustained asymptotic bandwidth of 288 i Mbytes/sec on all the nodes.The hardware-based barrier is almost insensitive to the network congestion, with 93% of the synchronizations taking less than 20 microsecs when the network is flooded with a background traffic of unicast messages. On the other hand, the software-based implementation suffers from a significant performance degradation. With high load the hardware broadcast maintains a reasonably good latency, delivering messages up to 2KB in 200 microsecs, while the software broadcast suffers from slightly higher latencies inherited from the synchronization mechanism. Both broadcast algorithms experience a significative performance degradation of the sustained bandwidth with large messages.