Algorithm-Based Fault Tolerance for Matrix Operations

Authors:
Kuang-Hua Huang;J. A. Abraham
Affiliations:
Engineering Research Center, AT&TTechnologies, Inc.;-
Venue:
IEEE Transactions on Computers
Year:
1984

Citing 8
Cited 107

Fault-tolerant algorithms for multiple processor systems

Fault-tolerant algorithms for multiple processor systems
Error Correction by Alternate-Data Retry

IEEE Transactions on Computers
Design of Self-Checking MOS-LSI Circuits: Application to a Four-Bit Microprocessor

IEEE Transactions on Computers
Watchdog Processors and Structural Integrity Checking

IEEE Transactions on Computers
Concurrent Error Detection in ALU's by Recomputing with Shifted Operands

IEEE Transactions on Computers
Design of a Massively Parallel Processor

IEEE Transactions on Computers
Fault Detection Capabilities of Alternating Logic

IEEE Transactions on Computers
Optimal rectangular code for high density magnetic tapes

IBM Journal of Research and Development

Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems

IEEE Transactions on Computers - The MIT Press scientific computation series
Fault-Detection by Result-Checking for the Eigenproblem

EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
A Comparison Study of the Behavior of Equivalent Algorithms in Fault Injection Experiments in Parallel Superscalar Architectures

SAFECOMP '01 Proceedings of the 20th International Conference on Computer Safety, Reliability and Security
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Experimental Evaluation of the Fail-Silent Behavior of a Distributed Real-Time Run-Time Support Built from COTS Components

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Low Cost Concurrent Test Implementation for Linear Digital Systems

ETW '00 Proceedings of the IEEE European Test Workshop
Cost analysis of a new algorithmic-based soft-error tolerant architecture

DFT '95 Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Experimental evaluation of the fail-silent behaviour in programs with consistency checks

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Compiler-assisted generation of error-detecting parallel programs

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Low-cost DC built-in self-test of linear analog circuits using checksums

VLSID '96 Proceedings of the 9th International Conference on VLSI Design: VLSI in Mobile Communication
Optimal Design of Checksum-Based Checkers for Fault Detection in Linear Analog Circuits

VLSID '97 Proceedings of the Tenth International Conference on VLSI Design: VLSI in Multimedia Applications
Software Development Kit for Dependable Applications in Embedded

ITC '00 Proceedings of the 2000 IEEE International Test Conference
Predicting Device Performance From Pass/Fail Transient Signal Analysis Data

ITC '00 Proceedings of the 2000 IEEE International Test Conference
Analytical Redundancy Based Approach for Concurrent Fault Detection in Linear Digital Systems

IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
New Techniques for Accelerating Fault Injection in VHDL Descriptions

IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Evaluating the Effectiveness of a Software Fault-Tolerance Technique on RISC- and CISC-Based Architectures

IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Method for designing and placing check sets based on control flow analysis of programs

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Feasibility and Effectiveness of the Algorithm for Overhead Reduction in Analog Checkers

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checking the Integrity of Trees

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
On-Line Error Monitoring for Several Data Structures

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
On-Line Fault Detection In DSP Circuits Using Extrapolated Checksums with Minimal Test Points

ITC '99 Proceedings of the 1999 IEEE International Test Conference
NetSolve/D: A Massively Parallel Grid Execution System for Scalable Data Intensive Collaboration

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
A New Hybrid Fault Detection Technique for Systems-on-a-Chip

IEEE Transactions on Computers
Software-Based Adaptive and Concurrent Self-Testing in Programmable Network Interfaces

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
An optimized hybrid approach to provide fault detection and correction in SoCs

Proceedings of the 20th annual conference on Integrated circuits and systems design
Exact Fault-Sensitive Feasibility Analysis of Real-Time Tasks

IEEE Transactions on Computers
Software-Based Failure Detection and Recovery in Programmable Network Interfaces

IEEE Transactions on Parallel and Distributed Systems
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
Globally optimized robust systems to overcome scaled CMOS reliability challenges

Proceedings of the conference on Design, automation and test in Europe
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
The Optimal Architecture Design of Two-Dimension Matrix Multiplication Jumping Systolic Array

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Sequential element design with built-in soft error resilience

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
An algorithm based mesh check-sum fault tolerant scheme for stream ciphers

International Journal of Communication Networks and Distributed Systems
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
A mesh check-sum ABFT scheme for stream ciphers

International Journal of Communication Networks and Distributed Systems
Fault Tolerant External Memory Algorithms

WADS '09 Proceedings of the 11th International Symposium on Algorithms and Data Structures
AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware

SAFECOMP '09 Proceedings of the 28th International Conference on Computer Safety, Reliability, and Security
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Nonconcurrent error correction in the presence of roundoff noise

IEEE Transactions on Circuits and Systems Part I: Regular Papers
Mapping matrix multiplication algorithm onto fault-tolerant systolic array

Computers & Mathematics with Applications
Counting in the Presence of Memory Faults

ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
Concurrent Error Detection in Multiplexer-Based Multipliers for Normal Basis of GF(2m) Using Double Parity Prediction Scheme

Journal of Signal Processing Systems
Checksum-based probabilistic transient-error compensation for linear digital systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Constructing numerically stable real number codes using evolutionary computation

Proceedings of the 12th annual conference on Genetic and evolutionary computation
Output-sensitive decoding for redundant residue systems

Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation
Spread-spectrum computation

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Efficient soft error-tolerant adaptive equalizers

IEEE Transactions on Circuits and Systems Part I: Regular Papers
Design techniques for cross-layer resilience

Proceedings of the Conference on Design, Automation and Test in Europe
Cross-layer resilience challenges: metrics and optimization

Proceedings of the Conference on Design, Automation and Test in Europe
ERSA: error resilient system architecture for probabilistic applications

Proceedings of the Conference on Design, Automation and Test in Europe
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Algorithm-based recovery for HPL

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Concurrent error detection in bit-serial normal basis multiplication over GF(2m) using multiple parity prediction scheme's

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
SRC: soft error detection and recovery for high performance linpack

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Proceedings of the 8th ACM International Conference on Computing Frontiers
Exploring the Limitations of Software-based Techniques in SEE Fault Coverage

Journal of Electronic Testing: Theory and Applications
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Run-through stabilization: an MPI proposal for process fault tolerance

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
Experimental study of resilient algorithms and data structures

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Robust distributed orthogonalization based on randomized aggregation

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Soft error resilient QR factorization for hybrid system with GPGPU

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Fault tolerant matrix-matrix multiplication: correcting soft errors on-line

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Memory space conscious loop iteration duplication for reliable execution

SAS'05 Proceedings of the 12th international conference on Static Analysis
Resilient algorithms and data structures

CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Higher dependability and security for mobile applications

SPC'06 Proceedings of the Third international conference on Security in Pervasive Computing
Operating system support to detect application hangs

VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
On software design for stochastic processors

Proceedings of the 49th Annual Design Automation Conference
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A tunable, software-based DRAM error detection and correction library for HPC

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Periodic and non-concurrent error detection and identification in one-hot encoded FSMs

Automatica (Journal of IFAC)
Error detection and correction in switched linear controllers via periodic and non-concurrent checks

Automatica (Journal of IFAC)
A class of fault-tolerant systolic arrays for matrix multiplication

Mathematical and Computer Modelling: An International Journal
Fault tolerant preconditioned conjugate gradient for sparse linear system solution

Proceedings of the 26th ACM international conference on Supercomputing
Data-driven fault tolerance for work stealing computations

Proceedings of the 26th ACM international conference on Supercomputing
Fault resilience of the algebraic multi-grid solver

Proceedings of the 26th ACM international conference on Supercomputing
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Time-Constraint-Aware Optimization of Assertions in Embedded Software

Journal of Electronic Testing: Theory and Applications
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Software encoded processing: building dependable systems with commodity hardware

SAFECOMP'07 Proceedings of the 26th international conference on Computer Safety, Reliability, and Security
Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Priority queues resilient to memory faults

WADS'07 Proceedings of the 10th international conference on Algorithms and Data Structures
Convergence analysis of evolutionary algorithms in the presence of crash-faults and cheaters

Computers & Mathematics with Applications
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An evaluation of user-level failure mitigation support in MPI

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
User level failure mitigation in MPI

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Concurrent and comparative fault simulation in SystemC and its application in robustness evaluation

Microprocessors & Microsystems
Correcting soft errors online in LU factorization

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Neutron sensitivity and software hardening strategies for matrix multiplication and FFT on graphics processing units

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Parallel reduction to hessenberg form with algorithm-based fault tolerance

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Multi-criteria checkpointing strategies: response-time versus resource utilization

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
CPU-GPU hybrid bidiagonal reduction with soft error resilience

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
A study of application-level recovery methods for transient network faults

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Self-stabilizing iterative solvers

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
A dual process redundancy approach to transient fault tolerance for ccNUMA architecture

Neurocomputing
Automated Algorithmic Error Resilience for Structured Grid Problems Based on Outlier Detection

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Detecting silent data corruption through data dynamic monitoring for scientific applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
An evaluation of User-Level Failure Mitigation support in MPI

Computing
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing

Quantified Score

Hi-index	14.99

Visualization

Abstract

The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple copies of low-cost processors to provide a large amount of computational capability for a small cost. In addition to achieving high performance, high reliability is also important to ensure that the results of long computations are valid. This paper proposes a novel system-level method of achieving high reliability, called algorithm-based fault tolerance. The technique encodes data at a high level, and algorithms are designed to operate on encoded data and produce encoded output data. The computation tasks within an algorithm are appropriately distributed among multiple computation units for fault tolerance. The technique is applied to matrix compomations which form the heart of many computation-intensive tasks. Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems. The method proposed can detect and correct any failure within a single processor in a multiple processor system. The number of processors needed to just detect errors in matrix multiplication is also studied.