SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol

Authors:
Abhinandan Das;Indranil Gupta;Ashish Motivala
Affiliations:
-;-;-
Venue:
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Year:
2002

Citing 0
Cited 19

On the design of distributed protocols from differential equations

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Efficient and Adaptive Epidemic-Style Protocols for Reliable and Scalable Multicast

IEEE Transactions on Parallel and Distributed Systems
Fireflies: scalable support for intrusion-tolerant network overlays

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
BitVault: a highly reliable distributed data retention platform

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Active and passive techniques for group size estimation in large-scale and dynamic distributed systems

Journal of Systems and Software
Reliable on-demand management operations for large-scale distributed applications

ACM SIGOPS Operating Systems Review - Gossip-based computer networking
COHESION - A microkernel based Desktop Grid platform for irregular task-parallel applications

Future Generation Computer Systems
Shortest-path routing in randomized DHT-based Peer-to-Peer systems

Computer Networks: The International Journal of Computer and Telecommunications Networking
A group membership service for large-scale grids

Proceedings of the 6th international workshop on Middleware for grid computing
Grouping algorithms for scalable self-monitoring distributed systems

Autonomics '08 Proceedings of the 2nd International Conference on Autonomic Computing and Communication Systems
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
SFDHT: a DHT designed for server farm

GLOBECOM'09 Proceedings of the 28th IEEE conference on Global telecommunications
Autonomous and scalable failure detection in distributed systems

International Journal of Autonomous and Adaptive Communications Systems
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation

PADS '11 Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation
Experimental evaluation of a failure detection service based on a gossip strategy

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
TransMAN: a group communication system for MANETs

ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
A case for design methodology research in self-* distributed systems

Self-star Properties in Complex Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for large-scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation and performance of the SWIM sub-system on a large cluster of commodity PCs.Unlike traditional heartbeating protocols, SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. Processes are monitored through an efficient peer-to-peer periodic randomized probing protocol. Both the expected time to first detection of each process failure, and the expected message load per member, do not vary with group size. Information about membership changes, such as process joins, drop-outs and failures, is propagated via piggybacking on ping messages and acknowledgments. This results in a robust and fast infection style (also epidemic or gossip-style) of dissemination.The rate of false failure detections in the SWIM system is reduced by modifying the protocol to allow group members to suspect a process before declaring it as failed - this allows the system to discover and rectify false failure detections. Finally, the protocol guarantees a deterministic time bound to detect failures.Experimental results from the SWIM prototype are presented. We discuss the extensibility of the design to a WAN-wide scale.