Experimental evaluation of the fail-silent behaviour in programs with consistency checks

Authors:
M. Z. Rela;H. Madeira;J. G. Silva
Affiliations:
-;-;-
Venue:
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Year:
1996

Citing 8
Cited 25

A measurement-based model for workload dependence of CPU errors

IEEE Transactions on Computers - The MIT Press scientific computation series
Concurrent Error Detection Using Watchdog Processors-A Survey

IEEE Transactions on Computers
RIFLE: A General Purpose Pin-level Fault Injector

EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Atomic Transactions

Distributed Systems - Architecture and Implementation, An Advanced Course
Two Fault Injection Techniques for Test of Fault Handling Mechanisms

Proceedings of the IEEE International Test Conference on Test: Faster, Better, Sooner
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Fault Tolerance in Process Control: An Overview And Examples of European Products

IEEE Micro
Building dependable systems: how to keep up with complexity

FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing

Dependability Analysis of a High-Speed Network Using Software-Implemented Fault Injection and Simulated Fault Injection

IEEE Transactions on Computers
Experimental Evaluation of Behavior-Based Failure-Detection Schemes in Real-Time Communication Networks

IEEE Transactions on Parallel and Distributed Systems
The Design and Verification of the Rio File Cache

IEEE Transactions on Computers
Fault-Detection by Result-Checking for the Eigenproblem

EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
A Comparison Study of the Behavior of Equivalent Algorithms in Fault Injection Experiments in Parallel Superscalar Architectures

SAFECOMP '01 Proceedings of the 20th International Conference on Computer Safety, Reliability and Security
A Study of Failure Models in Feedback Control Systems

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Experimental Evaluation of the Fail-Silent Behavior of a Distributed Real-Time Run-Time Support Built from COTS Components

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Experimental assessment of parallel systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Software Development Kit for Dependable Applications in Embedded

ITC '00 Proceedings of the 2000 IEEE International Test Conference
Predicting Device Performance From Pass/Fail Transient Signal Analysis Data

ITC '00 Proceedings of the 2000 IEEE International Test Conference
Software-Based Fault Tolerant Computing

Ubiquity
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Using loop invariants to fight soft errors in data caches

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Improving scratch-pad memory reliability through compiler-guided data block duplication

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Application semantic driven assertions toward fault tolerant computing

Ubiquity
Software based fault tolerance: a survey

Ubiquity
Fault Tolerant External Memory Algorithms

WADS '09 Proceedings of the 11th International Symposium on Algorithms and Data Structures
Counting in the Presence of Memory Faults

ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
Experiences with the design of a run-time check

SAFECOMP'06 Proceedings of the 25th international conference on Computer Safety, Reliability, and Security
On the effectiveness of run-time checks

SAFECOMP'05 Proceedings of the 24th international conference on Computer Safety, Reliability, and Security
Resilient algorithms and data structures

CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Time-Constraint-Aware Optimization of Assertions in Embedded Software

Journal of Electronic Testing: Theory and Applications
Priority queues resilient to memory faults

WADS'07 Proceedings of the 10th international conference on Algorithms and Data Structures
Comprehensive analysis of software countermeasures against fault attacks

Proceedings of the Conference on Design, Automation and Test in Europe
A dual process redundancy approach to transient fault tolerance for ccNUMA architecture

Neurocomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

An important research topic deals with the investigation of whether a non-duplicated computer can be made fail-silent, since that behaviour is a-priori assumed in many algorithms. However, previous research has shown that in systems using a simple behaviour based error detection mechanism invisible to the programmer (e.g. memory protection), the percentage of fail-silent violations could be higher than 10%. Since the study of these errors has shown that they were mostly caused by pure data errors, we evaluate the effectiveness of software techniques capable of checking the semantics of the data, such as assertions, to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40% of the fail-silent model violations that escape the simple hardware-based error detection techniques. In order to decouple the intrinsic limitations of the tests used from other factors that might affect its error detection capabilities, we evaluated a special class of software checks known for its high theoretical coverage: algorithm based fault tolerance (ABFT). The analysis of the remaining errors showed that most of them remained undetected due to short range control flow errors. When very simple software-based control flow checking was associated to the semantic tests, the target system, without any dedicated error detection hardware, behaved according to the fail-silent model for about 98% of all the faults injected.