Performance Evaluation of I/O Traffic and Placement of I/O Nodes on a High Performance Network
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Collective communication patterns on the quadrics network
Performance analysis and grid computing
Scalable Hardware-Based Multicast Trees
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Optimizing All-to-All Collective Communication by Exploiting Concurrency in Modern Networks
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimizing Strided Remote Memory Access Operations on the Quadrics QsNetII Network Interconnect
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
MPI collective algorithm selection and quadtree encoding
Parallel Computing
NIC-based reduction algorithms for large-scale clusters
International Journal of High Performance Computing and Networking
Efficiency and scalability of barrier synchronization on NoC based many-core architectures
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
HPPNetSim: a parallel simulation of large-scale interconnection networks
SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
Scalability of relaxed consistency models in NoC based multicore architectures
ACM SIGARCH Computer Architecture News
Optimised gather collectives on QsNetII
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Exploiting atomic operations for barrier on cray XE/XK systems
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Hi-index | 0.00 |
The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics network supports both hardware- and software-based collectives. We describe the main features of the two building blocks of this network, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory.Experimental results conducted on 64-node AlphaServer cluster indicate that the time to complete the hardware-based barrier synchronization on the whole network is as low as 6 microsecs, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 microsecs. With the broadcast, similar performance is achieved by the hardware- and software-based implementations, which can deliver messages of up to 256 bytes in 13 microsecs and can get a sustained asymptotic bandwidth of 288 i Mbytes/sec on all the nodes.The hardware-based barrier is almost insensitive to the network congestion, with 93% of the synchronizations taking less than 20 microsecs when the network is flooded with a background traffic of unicast messages. On the other hand, the software-based implementation suffers from a significant performance degradation. With high load the hardware broadcast maintains a reasonably good latency, delivering messages up to 2KB in 200 microsecs, while the software broadcast suffers from slightly higher latencies inherited from the synchronization mechanism. Both broadcast algorithms experience a significative performance degradation of the sustained bandwidth with large messages.