Determining the last process to fail

Authors:
Dale Skeen
Affiliations:
Cornell University and IBM Research Laboratory, San Jose, CA
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1985

Citing 3
Cited 14

Reliability mechanisms for SDD-1: a system for distributed databases

ACM Transactions on Database Systems (TODS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
A recovery algorithm for a distributed database system

PODS '83 Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems

Low cost management of replicated data in fault-tolerant distributed systems

ACM Transactions on Computer Systems (TOCS)
Reliable communication in the presence of failures

ACM Transactions on Computer Systems (TOCS)
Exploiting virtual synchrony in distributed systems

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Resilient Objects in Broadband Networks

IEEE Transactions on Software Engineering
Replicated data management in distributed database systems

ACM SIGMOD Record
Using process groups to implement failure detection in asynchronous environments

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Effect of Fault Tolerance on Response Time-Analysis of the Primary Site Approach

IEEE Transactions on Computers
Enriched View Synchrony: A Programming Paradigm for Partitionable Asynchronous Distributed Systems

IEEE Transactions on Computers
The effect of failure and repair distributions on consistency protocols for replicated data objects

ANSS '89 Proceedings of the 22nd annual symposium on Simulation
Replication and fault-tolerance in the ISIS system

Proceedings of the tenth ACM symposium on Operating systems principles
Disconnection modes for mobile databases

Wireless Networks
Programming Partition-Aware Network Applications

Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Synchronous Consensus for Dependent Process Failures

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Application-based dynamic primary views in asynchronous distributed systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

A total failure occurs whenever all processes cooperatively executing a distributed task fail before the task completes. A frequent prerequisite for recovery from a total failure is identification of the last set (LAST) of processes to fail. Necessary and sufficient conditions are derived here for computing LAST from the local failure data of recovered processes. These conditions are then translated into procedures for deciding LAST membership, using either complete or incomplete failure data. The choice of failure data is itself dictated by two requirements: (1) it can be cheaply maintained, and (2) it must afford maximum fault-tolerance in the sense that the expected number of recoveries required for identifying LAST is minimized.