CHARM++: a portable concurrent object oriented system based on C++
OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Software reliability via run-time result-checking
Journal of the ACM (JACM)
NAMD2: greater scalability for parallel molecular dynamics
Journal of Computational Physics - Special issue on computational molecular biophysics
Asynchronous Iterative Methods for Multiprocessors
Journal of the ACM (JACM)
When the CRC and TCP checksum disagree
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms
IEEE Transactions on Computers
A network-failure-tolerant message-passing system for terascale clusters
ICS '02 Proceedings of the 16th international conference on Supercomputing
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Fault Injection Techniques and Tools
Computer
Impact of Deep Submicron Technology on Dependability of VLSI Circuits
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Experimental assessment of parallel systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
The Cactus Code: A Problem Solving Environment for the Grid
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Assessing the effects of communication faults on parallel applications
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
Reliability challenges in large systems
Future Generation Computer Systems
A memory soft error measurement on production systems
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
Performability modeling for scheduling and fault tolerance strategies for scientific workflows
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Implementing Reliable Data Structures for MPI Services in High Component Count Systems
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
International Journal of High Performance Computing Applications
Reliability challenges in large systems
Future Generation Computer Systems
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Experimental study of resilient algorithms and data structures
SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Resilient algorithms and data structures
CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Evaluating application vulnerability to soft errors in multi-level cache hierarchy
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Using unreliable virtual hardware to inject errors in extreme-scale systems
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Predictable quality of service atop degradable distributed systems
Cluster Computing
Hi-index | 0.00 |
Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to thousands and with proposed petaflop system likely to contain tens of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. Concomitantly, understanding application sensitivity to system failures is critical to establishing confidence in the outputs of large-scale applications. Using software fault injection, we simulated single bit memory errors, register file upsets and MPI message payload corruption and measured the behavioral responses for a suite of MPI applications. These experiments showed that most applications are very sensitive to even single errors. Perhaps most worrisome, the errors were often undetected, yielding erroneous output with no user indicators. Encouragingly, even minimal internal application error checking and program assertions can detect some of the faults we injected.