Fast parallel algorithms for short-range molecular dynamics
Journal of Computational Physics
High performance messaging on workstations: Illinois fast messages (FM) for Myrinet
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
When the CRC and TCP checksum disagree
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
End-to-end arguments in system design
ACM Transactions on Computer Systems (TOCS)
Computer Networks
NAMD: biomolecular simulation on thousands of processors
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A network-failure-tolerant message-passing system for terascale clusters
International Journal of Parallel Programming
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?
HOTI '05 Proceedings of the 13th Symposium on High Performance Interconnects
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters
Proceedings of the 21st annual international conference on Supercomputing
An overview of QoS capabilities in infiniband, advanced switching interconnect, and ethernet
IEEE Communications Magazine
Hi-index | 0.00 |
An important part of modern supercomputing platforms is the network interconnect. As the number of computing nodes in clusters have increased, the role of the interconnect has become more important. Modern interconnects, such as InfiniBand, Quadrics, and Myrinet have become popular due to their low latency and increased performance over traditional Ethernet. As these interconnects become more widely used and clusters continue to scale, design choices such as where data reliability should be provided are an important issue. In this work we address the issue of network reliability design using InfiniBand as a case study. Unlike other high-performance interconnects, InfiniBand exposes both reliable and unreliable APIs. As part of our study we implement the Message Passing Interface (MPI) over the Unreliable Connection (UC) transport and compare with the Reliable Connection (RC) and Unreliable Datagram (UD) transports for MPI. We detail the costs of reliability for different message patterns and show that providing reliability in software instead of hardware can increase performance up to 25% in a molecular dynamics application (NAMD) on a 512-core InfiniBand cluster.