The group membership problem in asynchronous systems
The group membership problem in asynchronous systems
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Fail-awareness in timed asynchronous systems
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
On scalable and efficient distributed failure detectors
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
A Probabilistically Correct Leader Election Protocol for Large Groups
DISC '00 Proceedings of the 14th International Conference on Distributed Computing
On the Quality of Service of Failure Detectors
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A group membership service for large-scale grids
Proceedings of the 6th international workshop on Middleware for grid computing
Survey: Survey of fault tolerant techniques for grid
Computer Science Review
Hi-index | 0.00 |
Failure detectors are an integral part of any fault tolerant distributed system and hence have been a well-studied area. However, earlier proposed failure detectors fail to perform efficiently when applied to Grid environments. Most of the earlier proposed detectors were either designed for local area networks or to handle small number of nodes and hence lack in areas such as scalability, efficiency, running times etc. In this paper we propose a highly scalable failure detector protocol that is aided by a membership management service. The membership management service is essential to make the failure detector transparent to changes in the system. Using a distributed heartbeat mechanism, for an unreliable failure detector, we have overcome the shortcomings of similar schemes proposed earlier. It realizes scalability by reducing context switching requirements and achieves faster failure detection . The membership management protocol handles membership issues with a worst case complexity of O(n) where n is the number of heartbeat groups. Note that n is much smaller than the total number of nodes in the Grid. The algorithm is also shown to be failure resilient and scalable.