Fault Tolerant Operating Systems
ACM Computing Surveys (CSUR)
Process backup in producer-consumer systems
SOSP '77 Proceedings of the sixth ACM symposium on Operating systems principles
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
ACM Computing Surveys (CSUR) - The MIT Press scientific computation series
Progressive transaction recovery in distributed DB/DC systems
IEEE Transactions on Computers - Special Issue on Real-Time Systems
A survey of process migration mechanisms
ACM SIGOPS Operating Systems Review
ACM Transactions on Computer Systems (TOCS)
Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Programming languages for distributed computing systems
ACM Computing Surveys (CSUR)
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Demonic memory for process histories
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
The cascade fault tolerance message system
CSC '89 Proceedings of the 17th conference on ACM Annual Computer Science Conference
Modeling of Hierarchical Distributed Systems with Fault-Tolerance
IEEE Transactions on Software Engineering
Fault-tolerant computing based on Mach
ACM SIGOPS Operating Systems Review
An implementation for small databases with high availability
ACM SIGOPS Operating Systems Review
An annotated bibliography of dependable distributed computing
ACM SIGOPS Operating Systems Review
An abstract model of rollback recovery control in distributed systems
ACM SIGOPS Operating Systems Review
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
VAXcluster: a closely-coupled distributed system
ACM Transactions on Computer Systems (TOCS)
Hypervisor-based fault tolerance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
On distributed object checkpointing and recovery
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Trade-offs in implementing causal message logging protocols
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Optimistic Crash Recovery without Changing Application Messages
IEEE Transactions on Parallel and Distributed Systems
File placement and process assignment due to resource sharing in a distributed system
WSC '85 Proceedings of the 17th conference on Winter simulation
Efficient transparent application recovery in client-server information systems
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Support for Software Interrupts in Log-Based Rollback-Recovery
IEEE Transactions on Computers
Fast cluster failover using virtual memory-mapped communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Replicated distributed programs
Proceedings of the tenth ACM symposium on Operating systems principles
Replication and fault-tolerance in the ISIS system
Proceedings of the tenth ACM symposium on Operating systems principles
The Cost of Recovery in Message Logging Protocols
IEEE Transactions on Knowledge and Data Engineering
An Efficient Protocol for Checkpointing Recovery in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Masking System Crashes in Database Application Programs
VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Quorum-Based Locking Protocol in Nested Invocations of Methods
DEXA '01 Proceedings of the 12th International Conference on Database and Expert Systems Applications
Supporting nondeterministic execution in fault-tolerant systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Recoverable mobile environment: design and trade-off analysis
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
PODC '84 Proceedings of the third annual ACM symposium on Principles of distributed computing
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Quorum-Based Protocol for Locking Replicas of Objects
ICCNMC '01 Proceedings of the 2001 International Conference on Computer Networks and Mobile Computing (ICCNMC'01)
Efficient damage assessment and repair in resilient distributed database systems
Das'01 Proceedings of the fifteenth annual working conference on Database and application security
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Causality tracking in causal message-logging protocols
Distributed Computing
Distributed file systems - a survey
ACM SIGOPS Operating Systems Review
Efficient algorithms for optimistic crash recovery
Distributed Computing
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
DP: a library for building portable, reliable distributed applications
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Transparent fault tolerance for parallel applications on networks of workstations
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Towards an Autonomic Element Architecture for ASSL
SEAMS '07 Proceedings of the 2007 International Workshop on Software Engineering for Adaptive and Self-Managing Systems
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
Efficient checkpointing of java software using context-sensitive capture and replay
Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Dependability, Abstraction, and Programming
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Damage assessment and repair in attack resilient distributed database systems
Computer Standards & Interfaces
Towards formal methods for agent-based systems
1FACS'96 Proceedings of the 1st BCS-FACS conference on Northern Formal Methods
Research: Designing a system infrastructure for distributed programs
Computer Communications
Hi-index | 0.01 |
A simple and general design uses message-based communication to provide software tolerance of single-point hardware failures. By delivering all interprocess messages to inactive backups for both the sender and the destination, both backups are kept in a state in which they can take over for their primaries. An implementation for the Auragen 4000 series of M68000-based systems is described. The operating system, AurosTM, is a distributed version of UNIX*. Major goals have been transparency of fault tolerance and efficient execution in the absence of failure.