Characterizing the impact of soft errors on iterative methods in scientific computing

Authors:
Manu Shantharam;Sowmyalatha Srinivasmurthy;Padma Raghavan
Affiliations:
The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the international conference on Supercomputing
Year:
2011

Citing 20
Cited 4

Parallel algorithms for sparse linear systems

SIAM Review
Scientific computing: an introduction with parallel computing

Scientific computing: an introduction with parallel computing
Towards polyalgorithmic linear system solvers for nonlinear elliptic problems

SIAM Journal on Scientific Computing
Terrestrial cosmic rays

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Domain decomposition: parallel multilevel methods for elliptic partial differential equations

Domain decomposition: parallel multilevel methods for elliptic partial differential equations
Scientific Computing

Scientific Computing
BoomerAMG: a parallel algebraic multigrid solver and preconditioner

Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
hypre: A Library of High Performance Preconditioners

ICCS '02 Proceedings of the International Conference on Computational Science-Part III
Soft-Error Detection through Software Fault-Tolerance Techniques

DFT '99 Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems
An Accurate Analysis of the Effects of Soft Errors in the Instruction and Data Caches of a Pipelined Microprocessor

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A Highly-Efficient Technique for Reducing Soft Errors in Static CMOS Circuits

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior

MASCOTS '06 Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation
The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations (Software, Environments, and Tools)

The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations (Software, Environments, and Tools)
Parallel Processing for Scientific Computing (Software, Environments and Tools)

Parallel Processing for Scientific Computing (Software, Environments and Tools)
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
A Design Approach for Soft Error Protection in Real-Time Embedded Systems

ASWEC '08 Proceedings of the 19th Australian Conference on Software Engineering
Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Compiler-assisted soft error detection under performance and energy constraints in embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
The university of Florida sparse matrix collection

ACM Transactions on Mathematical Software (TOMS)

Fault tolerant preconditioned conjugate gradient for sparse linear system solution

Proceedings of the 26th ACM international conference on Supercomputing
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increase in on-chip transistor count facilitates achieving higher performance, but at the expense of higher susceptibility to soft errors. In this paper, we characterize the challenges posed by soft errors for large-scale applications representative of workloads on supercomputing systems. Such applications are typically based on the computational solution of partial differential equation models using either explicit or implicit methods. In both cases, the execution time of such applications is typically dominated by the time spent in their underlying sparse matrix vector multiplication kernel (SpMV, t ← A • y). We provide a theoretical analysis of the impact of a single soft error through its propagation by a sequence of sparse matrix vector multiplication operations. Our analysis indicates that a single soft error in some ith component of the vector y can corrupt the entire resultant vector in a relatively short sequence of SpMV operations. Additionally, the propagation pattern corresponds to the sparsity structure of the coefficient matrix A and the magnitude of the error grows non-linearly as(||Ai||2∗)k, after k SpMV operations, where, ||Ai∗||2 is the 2-norm of the ith row of A. We corroborate this analysis with empirical observations on a model heat equation using explicit method and well known sparse matrix systems (matrices from a test suite) for the implicit method using iterative solvers such as CG, PCG and SOR. Our results indicate that explicit schemes will suffer from soft error induced numerical instabilities, thus exacerbating intrinsic stability issues for such methods, that impose constraints on relative time and space step sizes. For implicit schemes, linear solver performance through widely used CG and PCG schemes, degrades by a factor as high as 200x, whereas, a stationary scheme such as SOR is inherently soft error resilient. Our results thus indicate the need for new approaches to achieve soft error resiliency in such methods and a critical evaluation of the tradeoffs among multiple metrics, including, performance, reliability and energy.