On the design of distributed protocols from differential equations
Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Efficient and Adaptive Epidemic-Style Protocols for Reliable and Scalable Multicast
IEEE Transactions on Parallel and Distributed Systems
Fireflies: scalable support for intrusion-tolerant network overlays
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
BitVault: a highly reliable distributed data retention platform
ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Journal of Systems and Software
Reliable on-demand management operations for large-scale distributed applications
ACM SIGOPS Operating Systems Review - Gossip-based computer networking
COHESION - A microkernel based Desktop Grid platform for irregular task-parallel applications
Future Generation Computer Systems
Shortest-path routing in randomized DHT-based Peer-to-Peer systems
Computer Networks: The International Journal of Computer and Telecommunications Networking
A group membership service for large-scale grids
Proceedings of the 6th international workshop on Middleware for grid computing
Grouping algorithms for scalable self-monitoring distributed systems
Autonomics '08 Proceedings of the 2nd International Conference on Autonomic Computing and Communication Systems
Cassandra: a decentralized structured storage system
ACM SIGOPS Operating Systems Review
SFDHT: a DHT designed for server farm
GLOBECOM'09 Proceedings of the 28th IEEE conference on Global telecommunications
Autonomous and scalable failure detection in distributed systems
International Journal of Autonomous and Adaptive Communications Systems
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation
PADS '11 Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation
Experimental evaluation of a failure detection service based on a gossip strategy
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
TransMAN: a group communication system for MANETs
ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
A case for design methodology research in self-* distributed systems
Self-star Properties in Complex Information Systems
Hi-index | 0.00 |
Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for large-scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation and performance of the SWIM sub-system on a large cluster of commodity PCs.Unlike traditional heartbeating protocols, SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. Processes are monitored through an efficient peer-to-peer periodic randomized probing protocol. Both the expected time to first detection of each process failure, and the expected message load per member, do not vary with group size. Information about membership changes, such as process joins, drop-outs and failures, is propagated via piggybacking on ping messages and acknowledgments. This results in a robust and fast infection style (also epidemic or gossip-style) of dissemination.The rate of false failure detections in the SWIM system is reduced by modifying the protocol to allow group members to suspect a process before declaring it as failed - this allows the system to discover and rectify false failure detections. Finally, the protocol guarantees a deterministic time bound to detect failures.Experimental results from the SWIM prototype are presented. We discuss the extensibility of the design to a WAN-wide scale.