A measurement-based model for workload dependence of CPU errors
IEEE Transactions on Computers - The MIT Press scientific computation series
Concurrent Error Detection Using Watchdog Processors-A Survey
IEEE Transactions on Computers
RIFLE: A General Purpose Pin-level Fault Injector
EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Distributed Systems - Architecture and Implementation, An Advanced Course
Two Fault Injection Techniques for Test of Fault Handling Mechanisms
Proceedings of the IEEE International Test Conference on Test: Faster, Better, Sooner
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Building dependable systems: how to keep up with complexity
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
The Design and Verification of the Rio File Cache
IEEE Transactions on Computers
Fault-Detection by Result-Checking for the Eigenproblem
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
SAFECOMP '01 Proceedings of the 20th International Conference on Computer Safety, Reliability and Security
A Study of Failure Models in Feedback Control Systems
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Experimental assessment of parallel systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Software Development Kit for Dependable Applications in Embedded
ITC '00 Proceedings of the 2000 IEEE International Test Conference
Predicting Device Performance From Pass/Fail Transient Signal Analysis Data
ITC '00 Proceedings of the 2000 IEEE International Test Conference
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Using loop invariants to fight soft errors in data caches
Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Improving scratch-pad memory reliability through compiler-guided data block duplication
ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Fault Tolerant External Memory Algorithms
WADS '09 Proceedings of the 11th International Symposium on Algorithms and Data Structures
Counting in the Presence of Memory Faults
ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
Experiences with the design of a run-time check
SAFECOMP'06 Proceedings of the 25th international conference on Computer Safety, Reliability, and Security
On the effectiveness of run-time checks
SAFECOMP'05 Proceedings of the 24th international conference on Computer Safety, Reliability, and Security
Resilient algorithms and data structures
CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Time-Constraint-Aware Optimization of Assertions in Embedded Software
Journal of Electronic Testing: Theory and Applications
Priority queues resilient to memory faults
WADS'07 Proceedings of the 10th international conference on Algorithms and Data Structures
Comprehensive analysis of software countermeasures against fault attacks
Proceedings of the Conference on Design, Automation and Test in Europe
Hi-index | 0.01 |
An important research topic deals with the investigation of whether a non-duplicated computer can be made fail-silent, since that behaviour is a-priori assumed in many algorithms. However, previous research has shown that in systems using a simple behaviour based error detection mechanism invisible to the programmer (e.g. memory protection), the percentage of fail-silent violations could be higher than 10%. Since the study of these errors has shown that they were mostly caused by pure data errors, we evaluate the effectiveness of software techniques capable of checking the semantics of the data, such as assertions, to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40% of the fail-silent model violations that escape the simple hardware-based error detection techniques. In order to decouple the intrinsic limitations of the tests used from other factors that might affect its error detection capabilities, we evaluated a special class of software checks known for its high theoretical coverage: algorithm based fault tolerance (ABFT). The analysis of the remaining errors showed that most of them remained undetected due to short range control flow errors. When very simple software-based control flow checking was associated to the semantic tests, the target system, without any dedicated error detection hardware, behaved according to the fail-silent model for about 98% of all the faults injected.