The process group approach to reliable distributed computing
Communications of the ACM
The Totem single-ring ordering and membership protocol
ACM Transactions on Computer Systems (TOCS)
A distributed mutual exclusion algorithm
ACM Transactions on Computer Systems (TOCS)
Horus: a flexible group communication system
Communications of the ACM
Communications of the ACM
Replication management using the state-machine approach
Distributed systems (2nd Ed.)
Group communication specifications: a comprehensive study
ACM Computing Surveys (CSUR)
Reliable Distributed Computing with the ISIS Toolkit
Reliable Distributed Computing with the ISIS Toolkit
Moshe: A group membership service for WANs
ACM Transactions on Computer Systems (TOCS)
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface
Journal of Parallel and Distributed Computing - Special issue on computational grids
The Anatomy of the Grid: Enabling Scalable Virtual Organizations
International Journal of High Performance Computing Applications
Failure resilient real-time data federation system
SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
Failure recovery mechanism in neighbor replica distribution architecture
ICICA'10 Proceedings of the First international conference on Information computing and applications
Hi-index | 0.00 |
We describe a replication-based protocol that uses group communication for fault tolerance in the Computational Grid. The Grid is partitioned into a number of clusters and each cluster has a designated coordinator that manages the states of the replicas within its cluster. The coordinators belong to a process group and the proposed protocol ensures the correct sequence of message deliveries to the replicas by the coordinators. Any failing node of the Grid is replaced by an active replica to provide correct continuation of the operation of the application. We show the theoretical framework along with illustrations of the replication protocol and its implementation results and analyze its performance and scalability.