On the reliability of consensus-based fault-tolerant distributed computing systems
ACM Transactions on Computer Systems (TOCS)
Fast randomized consensus using shared memory
Journal of Algorithms
The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
Minimizing the Maximum Delay for Reaching Consensus in Quorum-Based Mutual Exclusion Schemes
IEEE Transactions on Parallel and Distributed Systems
The clearinghouse: a decentralized agent for locating named objects in a distributed environment
ACM Transactions on Information Systems (TOIS)
Reaching Approximate Agreement with Mixed-Mode Faults
IEEE Transactions on Parallel and Distributed Systems
GROUP MEMBERSHIP IN THE EPIDEMIC STYLE
GROUP MEMBERSHIP IN THE EPIDEMIC STYLE
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Achieving Scalable Cluster System Analysis and Management with a Gossip-Based Network Service
LCN '01 Proceedings of the 26th Annual IEEE Conference on Local Computer Networks
The co-replication methodology and its application to structured parallel programs
Proceedings of the 2007 symposium on Component and framework technology in high-performance and scientific computing
Compact samples for data dissemination
Journal of Computer and System Sciences
ICOST '09 Proceedings of the 7th International Conference on Smart Homes and Health Telematics: Ambient Assistive Health and Wellness Management in the Heart of the City
Counter-based reliability optimization for gossip-based broadcasting
Computer Communications
International Journal of Parallel Programming
Mobility and cooperation to thwart node capture attacks in MANETs
EURASIP Journal on Wireless Communications and Networking - Special issue on wireless network security
GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
Compact samples for data dissemination
ICDT'07 Proceedings of the 11th international conference on Database Theory
Asynchronous failed sensor node detection method for sensor networks
International Journal of Network Management
Hi-index | 0.00 |
Gossip protocols provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. However, in order to be effective with application recovery and reconfiguration, these protocols require mechanisms by which failures can be detected with system-wide consensus in a scalable fashion. This paper presents three new gossip-style protocols supported by a novel algorithm to achieve consensus in scalable, heterogeneous clusters. The round-robin protocol improves on basic randomized gossiping by distributing gossip messages in a deterministic order that optimizes bandwidth consumption. Redundant gossiping is completely eliminated in the binary round-robin protocol, and the round-robin with sequence check protocol is a useful extension that yields efficient detection times without the need for system-specific optimization. The distributed consensus algorithm works with these gossip protocols to achieve agreement among the operable nodes in the cluster on the state of the system featuring either a flat or a layered design. The various protocols are simulated and evaluated in terms of consensus time and scalability using a high-fidelity, fault-injection model for distributed systems comprised of clusters of workstations connected by high-performance networks.