A replication-based fault tolerance protocol using group communication for the grid

Authors:
Kayhan Erciyes
Affiliations:
Computer Eng. Dept., Izmir Institute of Technology, Urla, Izmir, Turkey
Venue:
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Year:
2006

Citing 11
Cited 2

The process group approach to reliable distributed computing

Communications of the ACM
The Totem single-ring ordering and membership protocol

ACM Transactions on Computer Systems (TOCS)
A distributed mutual exclusion algorithm

ACM Transactions on Computer Systems (TOCS)
Horus: a flexible group communication system

Communications of the ACM
Synchronous and asynchronous

Communications of the ACM
Replication management using the state-machine approach

Distributed systems (2nd Ed.)
Group communication specifications: a comprehensive study

ACM Computing Surveys (CSUR)
Reliable Distributed Computing with the ISIS Toolkit

Reliable Distributed Computing with the ISIS Toolkit
Moshe: A group membership service for WANs

ACM Transactions on Computer Systems (TOCS)
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface

Journal of Parallel and Distributed Computing - Special issue on computational grids
The Anatomy of the Grid: Enabling Scalable Virtual Organizations

International Journal of High Performance Computing Applications

Failure resilient real-time data federation system

SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
Failure recovery mechanism in neighbor replica distribution architecture

ICICA'10 Proceedings of the First international conference on Information computing and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a replication-based protocol that uses group communication for fault tolerance in the Computational Grid. The Grid is partitioned into a number of clusters and each cluster has a designated coordinator that manages the states of the replicas within its cluster. The coordinators belong to a process group and the proposed protocol ensures the correct sequence of message deliveries to the replicas by the coordinators. Any failing node of the Grid is replaced by an active replica to provide correct continuation of the operation of the application. We show the theoretical framework along with illustrations of the replication protocol and its implementation results and analyze its performance and scalability.