Epidemic algorithms for replicated database maintenance
PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Consensus in the presence of partial synchrony
Journal of the ACM (JACM)
The process group approach to reliable distributed computing
Communications of the ACM
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Lua—an extensible extension language
Software—Practice & Experience
The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
On the Quality of Service of Failure Detectors
IEEE Transactions on Computers
SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Legion: An Operating System for Wide-Area Computing
Legion: An Operating System for Wide-Area Computing
Failure Detection and Membership Management in Grid Environments
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
The " Accrual Failure Detector
SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Concurrency and Computation: Practice & Experience - Middleware for Grid Computing
ALTER: Adaptive Failure Detection Services for Grids
SCC '05 Proceedings of the 2005 IEEE International Conference on Services Computing - Volume 01
A new adaptive accrual failure detector for dependable distributed systems
Proceedings of the 2007 ACM symposium on Applied computing
HyParView: A Membership Protocol for Reliable Gossip-Based Broadcast
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Application execution management on the InteGrade opportunistic grid middleware
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
In this paper, we propose a decentralized group membership service that can be incorporated into existing grid middleware to make it more reliable. This service includes a flexible failure detector that adapts dynamically to changing network conditions and can be configured with a number of failure recovery strategies. Moreover, it disseminates information about membership changes (new processes, failures, etc.) in a scalable and efficient manner. We conducted a preliminary evaluation of the proposed service by simulating a grid with up to 140 nodes distributed across three domains separated by a wide-area network. This evaluation showed that the proposed service performs well both in the absence and in the presence of process failures.