Soft error vulnerability of iterative linear algebra methods

Authors:
Greg Bronevetsky;Bronis de Supinski
Affiliations:
Lawrence Livermore National Laboratory, Livermore, CA, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA
Venue:
Proceedings of the 22nd annual international conference on Supercomputing
Year:
2008

Citing 15
Cited 12

A Linear Algebraic Model of Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Algorithmic fault tolerance using the Lanczos method

SIAM Journal on Matrix Analysis and Applications
IBM experiments in soft fails in computer electronics (1978–1994)

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Algorithm Based Fault Tolerance versus Result-Checking for Matrix Computations

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
An Algorithm Based Error Detection Scheme for the Multigrid Algorithm

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Comparison of Physical and Software-Implemented Fault Injection Techniques

IEEE Transactions on Computers
Susceptibility of Commodity Systems and Software to Memory Soft Errors

IEEE Transactions on Computers
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Terrestrial-Based Radiation Upsets: A Cautionary Tale

FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Examining ACE analysis reliability estimates using fault-injection

Proceedings of the 34th annual international symposium on Computer architecture
Soft error rate analysis for sequential circuits

Proceedings of the conference on Design, automation and test in Europe
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

SIAM Journal on Scientific Computing
Parallel fault tolerant algorithms for parabolic problems

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Characterizing the impact of soft errors on iterative methods in scientific computing

Proceedings of the international conference on Supercomputing
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Fault tolerant preconditioned conjugate gradient for sparse linear system solution

Proceedings of the 26th ACM international conference on Supercomputing
Fault resilience of the algebraic multi-grid solver

Proceedings of the 26th ACM international conference on Supercomputing
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Correcting soft errors online in LU factorization

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Using unreliable virtual hardware to inject errors in extreme-scale systems

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Self-stabilizing iterative solvers

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft error rates were significant primarily in space and high-atmospheric computing. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming important even at terrestrial altitudes. Due to their large number of components, supercomputers are particularly susceptible to soft errors. Since many large scale parallel scientific applications use iterative linear algebra methods, the soft error vulnerability of these methods constitutes a large fraction of the applications' overall vulnerability. Many users consider these methods invulnerable to most soft errors since they converge from an imprecise solution to a precise one. However, we show in this paper that iterative methods are vulnerable to soft errors, exhibiting both silent data corruptions and poor ability to detect errors. Further, we evaluate a variety of soft error detection and tolerance techniques, including checkpointing, linear matrix encodings, and residual tracking techniques.