miNI: reducing network interface memory requirements with dynamic handle lookup
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Cluster communication protocols for parallel-programming systems
ACM Transactions on Computer Systems (TOCS)
A study of application-level recovery methods for transient network faults
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Hi-index | 0.01 |
In this paper, we investigate how system area networks can deal with transient and permanent network failures. We design an implement a firmware-level retransmission schemeto tolerate transient failures and an on-demand network mapping scheme to deal with permanent failures. Both schemes are transparent to applications and are conceptually simple and suitable for low-level implementations, e.g. in firmware. We then examine how the retransmission scheme affects system performance and how various protocol parameters impact system behavior. We analyze and evaluate system performance by using a real implementation on a state-of-the art cluster an both micro-benchmarks and real applications from the SPLASH-2 suite.