Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems
IEEE Transactions on Computers - The MIT Press scientific computation series
An analysis of algorithm-based fault tolerance techniques
Journal of Parallel and Distributed Computing
A Linear Algebraic Model of Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
IEEE Transactions on Computers
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Practical Issues in the Use of ABFT and a New Failure Model
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Time Redundancy Based Soft-Error Tolerance to Rescue Nanometer Technologies
VTS '99 Proceedings of the 1999 17TH IEEE VLSI Test Symposium
Fault-tolerant matrix operations for parallel and distributed systems
Fault-tolerant matrix operations for parallel and distributed systems
Condition Numbers of Gaussian Random Matrices
SIAM Journal on Matrix Analysis and Applications
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
Algorithm-Based Fault Tolerance for Fail-Stop Failures
IEEE Transactions on Parallel and Distributed Systems
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Shoestring: probabilistic soft error reliability on the cheap
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Constructing numerically stable real number codes using evolutionary computation
Proceedings of the 12th annual conference on Genetic and evolutionary computation
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Matrix Multiplication on GPUs with On-Line Fault Tolerance
ISPA '11 Proceedings of the 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications
High Performance Dense Linear System Solver with Soft Error Resilience
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
In high-performance systems, the probability of failure is higher with more processors. Errors in calculations may occur that cannot be detected by outside means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead; in contrast to an existing approach that requires repeated calculation, it repeats only a fraction of the calculation during recovery. Because of error propagation, the existing approach has to repeat calculations when soft errors occur. Our approach detects and corrects errors during the calculation before they are propagated. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.