Enhancing Replica Management Services to Cope with Group Failures

Authors:
Paul D. Ezhilchelvan;Santosh K. Shrivastava
Affiliations:
-;-
Venue:
Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Year:
1999

Citing 17
Cited 0

Exploiting virtual synchrony in distributed systems

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Dynamic voting algorithms for maintaining the consistency of a replicated database

ACM Transactions on Database Systems (TODS)
Using process groups to implement failure detection in asynchronous environments

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Totem: a fault-tolerant multicast group communication system

Communications of the ACM
The weakest failure detector for solving consensus

Journal of the ACM (JACM)
Enriched View Synchrony: A Programming Paradigm for Partitionable Asynchronous Distributed Systems

IEEE Transactions on Computers
Dynamic voting for consistent primary components

PODC '97 Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing
The implementation of a CORBA object group service

Theory and Practice of Object Systems - Special issue high availability in CORBA
Increasing the resilience of distributed and replicated database systems

Journal of Computer and System Sciences - Fourteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems
Nonblocking commit protocols

SIGMOD '81 Proceedings of the 1981 ACM SIGMOD international conference on Management of data
Membership Algorithms for Multicast Communication Groups

WDAG '92 Proceedings of the 6th International Workshop on Distributed Algorithms
Primary Partition "Virtually-Synchronous Communication" harder than Consensus

WDAG '94 Proceedings of the 8th International Workshop on Distributed Algorithms
Notes on Data Base Operating Systems

Operating Systems, An Advanced Course
Asynchronous Protocols to Meet Real-Time Constraints: Is It Really Sensible? How to Proceed?

ISORC '98 Proceedings of the The 1st IEEE International Symposium on Object-Oriented Real-Time Distributed Computing
Enhancing Replica Management Services to Tolerate Group Failures

ISORC '99 Proceedings of the 2nd IEEE International Symposium on Object-Oriented Real-Time Distributed Computing
Newtop: a fault-tolerant group communication protocol

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Group Membership and View Synchrony in Partitionable Asynchronous Distributed Systems: Specifications

Group Membership and View Synchrony in Partitionable Asynchronous Distributed Systems: Specifications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a distributed system, replication of components, such as objects, is a well known way of achieving availability. For increased availability, crashed and disconnected components must be replaced by new components on available spare nodes. This replacement results in the membership of the replicated group 'walking' over a number of machines during system operation. In this context, we address the problem of reconfiguring a group after the group as an entity has failed. Such a failure is termed a group failure which, for example, can be the crash of every component in the group or the group being partitioned into minority islands. The solution assumes crash-proof storage, and eventual recovery of crashed nodes and healing of partitions. It guarantees that (i) the number of groups reconfigured after a group failure is never more than one, and (ii) the reconfigured group contains a majority of the components which were members of the group just before the group failure occurred, so that the loss of state information due to a group failure is minimal. Though the protocol is subject to blocking, it remains efficient in terms of communication rounds and use of stable store, during both normal operations and reconfiguration after a group failure.