Totem: a fault-tolerant multicast group communication system
Communications of the ACM
Horus: a flexible group communication system
Communications of the ACM
A Metaobject Architecture for Fault-Tolerant Distributed Systems: The FRIENDS Approach
IEEE Transactions on Computers
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems
IEEE Transactions on Parallel and Distributed Systems
Reliable Distributed Computing with the ISIS Toolkit
Reliable Distributed Computing with the ISIS Toolkit
The Möbius Framework and Its Implementation
IEEE Transactions on Software Engineering
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Providing QoS Customization in Distributed Object Systems
Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
State Synchronization and Recovery for Strongly Consistent Replicated CORBA Objects
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
An Experimental Evaluation of the REE SIFT Environment for Spaceborne Applications
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Experimental Evaluation of a COTS System for Space Application
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
A Fault Tolerance Framework for CORBA
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors
IPDS '00 Proceedings of the 4th International Computer Performance and Dependability Symposium
The ensemble system
Hierarchical error detection in a software implemented fault tolerance (sift) environment
Hierarchical error detection in a software implemented fault tolerance (sift) environment
A process architecture and runtime environment for dependable distributed applications
A process architecture and runtime environment for dependable distributed applications
A system model for dynamically reconfigurable software
IBM Systems Journal
Application Fault Tolerance with Armor Middleware
IEEE Internet Computing
ACM SIGBED Review - Special issue: The second workshop on high performance, fault adaptive, large scale embedded real-time systems (FALSE-II)
Scheduling Tradeoffs for Heterogeneous Computing on an Advanced Space Processing Platform
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 2
Exploring recovery from operating system lockups
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Cluster Computing
Fault injection-based assessment of partial fault tolerance in stream processing applications
Proceedings of the 5th ACM international conference on Distributed event-based system
An architectural framework for detecting process hangs/crashes
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Hi-index | 0.00 |
Few distributed software-implemented fault tolerance (SIFT) environments have been experimentally evaluated using substantial applications to show that they protect both themselves and the applications from errors. This paper presents an experimental evaluation of a SIFT environment used to oversee spaceborne applications as part of the Remote Exploration and Experimentation (REE) program at the Jet Propulsion Laboratory. The SIFT environment is built around a set of self-checking ARMOR processes running on different machines that provide error detection and recovery services to themselves and to the REE applications. An evaluation methodology is presented in which over 28,000 errors were injected into both the SIFT processes and two representative REE applications. The experiments were split into three groups of error injections, with each group successively stressing the SIFT error detection and recovery more than the previous group. The results show that the SIFT environment added negligible overhead to the application's execution time during failure-free runs. Correlated failures affecting a SIFT process and application process are possible, but the division of detection and recovery responsibilities in the SIFT environment allows it to recover from these multiple failure scenarios. Only 28 cases were observed in which either the application failed to start or the SIFT environment failed to recognize that the application had completed. Further investigations showed that assertions within the SIFT processes驴coupled with object-based incremental checkpointing驴were effective in preventing system failures by protecting dynamic data within the SIFT processes.