Failure Detection and Membership Management in Grid Environments

Authors:
Amit Jain;R. K. Shyamasundar
Affiliations:
Tata Institute of Fundamental Research, India;Tata Institute of Fundamental Research, India
Venue:
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Year:
2004

Citing 9
Cited 3

The group membership problem in asynchronous systems

The group membership problem in asynchronous systems
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Fail-awareness in timed asynchronous systems

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
On scalable and efficient distributed failure detectors

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
A Probabilistically Correct Leader Election Protocol for Large Groups

DISC '00 Proceedings of the 14th International Conference on Distributed Computing
On the Quality of Service of Failure Detectors

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Experimental Analysis of a Gossip-Based Service for Scalable, Distributed Failure Detection and Consensus

Cluster Computing
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A group membership service for large-scale grids

Proceedings of the 6th international workshop on Middleware for grid computing
Survey: Survey of fault tolerant techniques for grid

Computer Science Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failure detectors are an integral part of any fault tolerant distributed system and hence have been a well-studied area. However, earlier proposed failure detectors fail to perform efficiently when applied to Grid environments. Most of the earlier proposed detectors were either designed for local area networks or to handle small number of nodes and hence lack in areas such as scalability, efficiency, running times etc. In this paper we propose a highly scalable failure detector protocol that is aided by a membership management service. The membership management service is essential to make the failure detector transparent to changes in the system. Using a distributed heartbeat mechanism, for an unreliable failure detector, we have overcome the shortcomings of similar schemes proposed earlier. It realizes scalability by reducing context switching requirements and achieves faster failure detection . The membership management protocol handles membership issues with a worst case complexity of O(n) where n is the number of heartbeat groups. Note that n is much smaller than the total number of nodes in the Grid. The algorithm is also shown to be failure resilient and scalable.