Reliability mechanisms for SDD-1: a system for distributed databases
ACM Transactions on Database Systems (TODS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
A recovery algorithm for a distributed database system
PODS '83 Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems
Low cost management of replicated data in fault-tolerant distributed systems
ACM Transactions on Computer Systems (TOCS)
Reliable communication in the presence of failures
ACM Transactions on Computer Systems (TOCS)
Exploiting virtual synchrony in distributed systems
SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Resilient Objects in Broadband Networks
IEEE Transactions on Software Engineering
Replicated data management in distributed database systems
ACM SIGMOD Record
Using process groups to implement failure detection in asynchronous environments
PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Effect of Fault Tolerance on Response Time-Analysis of the Primary Site Approach
IEEE Transactions on Computers
Enriched View Synchrony: A Programming Paradigm for Partitionable Asynchronous Distributed Systems
IEEE Transactions on Computers
The effect of failure and repair distributions on consistency protocols for replicated data objects
ANSS '89 Proceedings of the 22nd annual symposium on Simulation
Replication and fault-tolerance in the ISIS system
Proceedings of the tenth ACM symposium on Operating systems principles
Disconnection modes for mobile databases
Wireless Networks
Programming Partition-Aware Network Applications
Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Synchronous Consensus for Dependent Process Failures
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Application-based dynamic primary views in asynchronous distributed systems
Journal of Parallel and Distributed Computing
Hi-index | 0.01 |
A total failure occurs whenever all processes cooperatively executing a distributed task fail before the task completes. A frequent prerequisite for recovery from a total failure is identification of the last set (LAST) of processes to fail. Necessary and sufficient conditions are derived here for computing LAST from the local failure data of recovered processes. These conditions are then translated into procedures for deciding LAST membership, using either complete or incomplete failure data. The choice of failure data is itself dictated by two requirements: (1) it can be cheaply maintained, and (2) it must afford maximum fault-tolerance in the sense that the expected number of recoveries required for identifying LAST is minimized.