Detection and correction of silent data corruption for large-scale high-performance computing

Authors:
David Fiala;Frank Mueller;Christian Engelmann;Rolf Riesen;Kurt Ferreira;Ron Brightwell
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;Oak Ridge Natl Lab, Oak Ridge, TN;IBM Ireland, Dublin, Ireland;Sandia Natl Labs, Albuquerque, NM;Sandia Natl Labs, Albuquerque, NM
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 20
Cited 4

Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Robust System Design with Built-In Soft-Error Resilience

Computer
Terrestrial-Based Radiation Upsets: A Cautionary Tale

FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals

IEEE Computer Architecture Letters
PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures

IEEE Transactions on Dependable and Secure Computing
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Redundancy management technique for space shuttle computers

IBM Journal of Research and Development
Transparent redundant computing with MPI

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
File I/O for MPI Applications in Redundant Execution Scenarios

PDP '12 Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Combining Partial Redundancy and Checkpointing for HPC

ICDCS '12 Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems

Replication for send-deterministic MPI HPC applications

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
ACR: automatic checkpoint/restart for soft and hard error protection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results. This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages. Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.