Assessing Fault Sensitivity in MPI Applications

Authors:
Charng-da Lu;Daniel A. Reed
Affiliations:
University of Illinois at Urbana-Champaign;University of North Carolina
Venue:
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Year:
2004

Citing 16
Cited 16

CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Terrestrial cosmic rays

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Software reliability via run-time result-checking

Journal of the ACM (JACM)
NAMD2: greater scalability for parallel molecular dynamics

Journal of Computational Physics - Special issue on computational molecular biophysics
Asynchronous Iterative Methods for Multiprocessors

Journal of the ACM (JACM)
When the CRC and TCP checksum disagree

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms

IEEE Transactions on Computers
A network-failure-tolerant message-passing system for terascale clusters

ICS '02 Proceedings of the 16th international conference on Supercomputing
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Fault Injection Techniques and Tools

Computer
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Impact of Deep Submicron Technology on Dependability of VLSI Circuits

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Experimental assessment of parallel systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
The Cactus Code: A Problem Solving Environment for the Grid

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Assessing the effects of communication faults on parallel applications

IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium

Reliability challenges in large systems

Future Generation Computer Systems
Software based fault tolerance: a survey

Ubiquity
A memory soft error measurement on production systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
Performability modeling for scheduling and fault tolerance strategies for scientific workflows

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Implementing Reliable Data Structures for MPI Services in High Component Count Systems

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Reliability challenges in large systems

Future Generation Computer Systems
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Experimental study of resilient algorithms and data structures

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Resilient algorithms and data structures

CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Evaluating application vulnerability to soft errors in multi-level cache hierarchy

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Using unreliable virtual hardware to inject errors in extreme-scale systems

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Predictable quality of service atop degradable distributed systems

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to thousands and with proposed petaflop system likely to contain tens of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. Concomitantly, understanding application sensitivity to system failures is critical to establishing confidence in the outputs of large-scale applications. Using software fault injection, we simulated single bit memory errors, register file upsets and MPI message payload corruption and measured the behavioral responses for a suite of MPI applications. These experiments showed that most applications are very sensitive to even single errors. Perhaps most worrisome, the errors were often undetected, yielding erroneous output with no user indicators. Encouragingly, even minimal internal application error checking and program assertions can detect some of the faults we injected.