Parallel algorithms for sparse linear systems
SIAM Review
Scientific computing: an introduction with parallel computing
Scientific computing: an introduction with parallel computing
Towards polyalgorithmic linear system solvers for nonlinear elliptic problems
SIAM Journal on Scientific Computing
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Domain decomposition: parallel multilevel methods for elliptic partial differential equations
Domain decomposition: parallel multilevel methods for elliptic partial differential equations
Scientific Computing
BoomerAMG: a parallel algebraic multigrid solver and preconditioner
Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
hypre: A Library of High Performance Preconditioners
ICCS '02 Proceedings of the International Conference on Computational Science-Part III
Soft-Error Detection through Software Fault-Tolerance Techniques
DFT '99 Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Fingerprinting: bounding soft-error detection latency and bandwidth
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A Highly-Efficient Technique for Reducing Soft Errors in Static CMOS Circuits
ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior
MASCOTS '06 Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation
The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations (Software, Environments, and Tools)
Parallel Processing for Scientific Computing (Software, Environments and Tools)
Parallel Processing for Scientific Computing (Software, Environments and Tools)
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
A Design Approach for Soft Error Protection in Real-Time Embedded Systems
ASWEC '08 Proceedings of the 19th Australian Conference on Software Engineering
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Compiler-assisted soft error detection under performance and energy constraints in embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
The university of Florida sparse matrix collection
ACM Transactions on Mathematical Software (TOMS)
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
Proceedings of the 26th ACM international conference on Supercomputing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Hi-index | 0.00 |
The increase in on-chip transistor count facilitates achieving higher performance, but at the expense of higher susceptibility to soft errors. In this paper, we characterize the challenges posed by soft errors for large-scale applications representative of workloads on supercomputing systems. Such applications are typically based on the computational solution of partial differential equation models using either explicit or implicit methods. In both cases, the execution time of such applications is typically dominated by the time spent in their underlying sparse matrix vector multiplication kernel (SpMV, t ← A • y). We provide a theoretical analysis of the impact of a single soft error through its propagation by a sequence of sparse matrix vector multiplication operations. Our analysis indicates that a single soft error in some ith component of the vector y can corrupt the entire resultant vector in a relatively short sequence of SpMV operations. Additionally, the propagation pattern corresponds to the sparsity structure of the coefficient matrix A and the magnitude of the error grows non-linearly as(||Ai||2∗)k, after k SpMV operations, where, ||Ai∗||2 is the 2-norm of the ith row of A. We corroborate this analysis with empirical observations on a model heat equation using explicit method and well known sparse matrix systems (matrices from a test suite) for the implicit method using iterative solvers such as CG, PCG and SOR. Our results indicate that explicit schemes will suffer from soft error induced numerical instabilities, thus exacerbating intrinsic stability issues for such methods, that impose constraints on relative time and space step sizes. For implicit schemes, linear solver performance through widely used CG and PCG schemes, degrades by a factor as high as 200x, whereas, a stationary scheme such as SOR is inherently soft error resilient. Our results thus indicate the need for new approaches to achieve soft error resiliency in such methods and a critical evaluation of the tradeoffs among multiple metrics, including, performance, reliability and energy.