Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
IEEE Transactions on Computers
Reliable computer systems (2nd ed.): design and evaluation
Reliable computer systems (2nd ed.): design and evaluation
The UltraSAN modeling environment
Performance Evaluation - Special issue: performance modeling tools
GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems
IEEE Transactions on Parallel and Distributed Systems
Transient Fault Tolerance in Digital Systems
IEEE Micro
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy
IEEE Transactions on Computers
Discriminating Fault Rate and Persistency to Improve Fault Treatment
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Building a dependable system from a legacy application with CORBA
Journal of Systems Architecture: the EUROMICRO Journal
Using Time over Threshold to Reduce Noise in Performance and Fault Management Systems
DSOM '00 Proceedings of the 11th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management: Services Management in Intelligent Networks
Effective Fault Treatment for Improving the Dependability of COTS and Legacy-Based Applications
IEEE Transactions on Dependable and Secure Computing
Achieving all the time, everywhere access in next-generation mobile networks
ACM SIGMOBILE Mobile Computing and Communications Review
Integrated support for handoff management and context awareness in heterogeneous wireless networks
MPAC '05 Proceedings of the 3rd international workshop on Middleware for pervasive and ad-hoc computing
Immucube: Scalable Fault-Tolerant Routing for k-ary n-cube Networks
IEEE Transactions on Parallel and Distributed Systems
Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters
IEEE Transactions on Dependable and Secure Computing
The CRUTIAL Architecture for Critical Information Infrastructures
Architecting Dependable Systems V
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
The PHDS project: a success story of cooperation between Industry and academia
ICAI'09 Proceedings of the 10th WSEAS international conference on Automation & information
From a federated to an integrated automotive architecture
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Supporting mobile ubiquitous applications with mobility prediction and soft handoff
SEUS'07 Proceedings of the 5th IFIP WG 10.2 international conference on Software technologies for embedded and ubiquitous systems
Software assumptions failure tolerance: role, strategies, and visions
Architecting dependable systems VII
System structure for dependable software systems
ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part III
Mobility management and communication support for nomadic applications
OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
Methods for fault tolerance in networks-on-chip
ACM Computing Surveys (CSUR)
Hi-index | 14.98 |
This paper presents a class of count-and-threshold mechanisms, collectively named $\alpha$-count, which are able to discriminate between transient faults and intermittent faults in computing systems. For many years, commercial systems have been using transient fault discrimination via threshold-based techniques. We aim to contribute to the utility of count-and-threshold schemes, by exploring their effects on the system. We adopt a mathematically defined structure, which is simple enough to analyze by standard tools. $\alpha$-count is equipped with internal parameters that can be tuned to suit environmental variables (such as transient fault rate, intermittent fault occurrence patterns). We carried out an extensive behavior analysis for two versions of the count-and-threshold scheme, assuming, first, exponentially distributed fault occurrencies and, then, more realistic fault patterns.