Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Terrestrial-Based Radiation Upsets: A Cautionary Tale
FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals
IEEE Computer Architecture Letters
PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures
IEEE Transactions on Dependable and Secure Computing
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Redundancy management technique for space shuttle computers
IBM Journal of Research and Development
Transparent redundant computing with MPI
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
File I/O for MPI Applications in Redundant Execution Scenarios
PDP '12 Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Combining Partial Redundancy and Checkpointing for HPC
ICDCS '12 Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems
Replication for send-deterministic MPI HPC applications
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
ACR: automatic checkpoint/restart for soft and hard error protection
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results. This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages. Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.