LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fast software implementation of error detection codes
IEEE/ACM Transactions on Networking (TON)
Implementation and Performance Evaluation of an Adaptable Failure Detector
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A network-failure-tolerant message-passing system for terascale clusters
International Journal of Parallel Programming
The Effectiveness of Checksums for Embedded Control Networks
IEEE Transactions on Dependable and Secure Computing
NewMadeleine: An efficient support for high-performance networks in MPICH2
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
International Journal of High Performance Computing Applications
A high performance superpipeline protocol for infiniband
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
A Sampling-Based Approach for Communication Libraries Auto-Tuning
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Network fault tolerance in open MPI
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
With the increase of the number of nodes in clusters, the probability of failures and unusual events increases. In this paper, we present checksum mechanisms to detect data corruption. We study the impact of checksums on network communication performance and we propose a mechanism to amortize their cost on InfiniBand. We have implemented our mechanisms in the NewMadeleine communication library. Our evaluation shows that our mechanisms to ensure message integrity do not impact noticeably the application performance, which is an improvement over the state of the art MPI implementations.