FUSE: lightweight guaranteed distributed failure notification

Authors:
John Dunagan;Nicholas J. A. Harvey;Michael B. Jones;Dejan Kostić;Marvin Theimer;Alec Wolman
Affiliations:
Microsoft Research, Microsoft Corporation, Redmond, WA;Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA;Microsoft Research, Microsoft Corporation, Redmond, WA;Department of Computer Science, Duke University, Durham, NC;Microsoft Research, Microsoft Corporation, Redmond, WA;Microsoft Research, Microsoft Corporation, Redmond, WA
Venue:
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Year:
2004

Citing 31
Cited 13

Using Time Instead of Timeout for Fault-Tolerant Distributed Systems.

ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed programming in Argus

Communications of the ACM
Knowledge and common knowledge in a distributed environment

Journal of the ACM (JACM)
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
On scalable and efficient distributed failure detectors

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
World wide failures

EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
OSPF: Anatomy of an Internet Routing Protocol

OSPF: Anatomy of an Internet Routing Protocol
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
SCRIBE: The Design of a Large-Scale Event Notification Infrastructure

NGC '01 Proceedings of the Third International COST264 Workshop on Networked Group Communication
Failure Detectors as First Class Objects

DOA '99 Proceedings of the International Symposium on Distributed Objects and Applications
Experimental Study of Internet Stability and Backbone Failures

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications

WIAPP '03 Proceedings of the The Third IEEE Workshop on Internet Applications
Scalability and accuracy in a large-scale network emulator

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
A comparison of hard-state and soft-state signaling protocols

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Herald: Achieving a Global Event Notification Service

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Unveiling the transport

ACM SIGCOMM Computer Communication Review
Adding High Availability and Autonomic Behavior to Web Services

Proceedings of the 26th International Conference on Software Engineering
Reliable Distributed Systems: Technologies, Web Services, and Applications

Reliable Distributed Systems: Technologies, Web Services, and Applications
Using runtime paths for macroanalysis

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Consistent and automatic replica regeneration

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Total recall: system support for automated availability management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
SkipNet: a scalable overlay network with practical locality properties

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Virtual ring routing: network routing inspired by DHTs

Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Latency and bandwidth-minimizing failure detectors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Design of the notification system for failure detectors

International Journal of High Performance Computing and Networking
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
UniFAFF: a unified framework for implementing autonomic fault management and failure detection for self-managing networks

International Journal of Network Management
Why should we integrate services, servers, and networking in a data center?

Proceedings of the 1st ACM workshop on Research on enterprise networking
Predicting and preventing inconsistencies in deployed distributed systems

ACM Transactions on Computer Systems (TOCS)
Skip ring topology in fast failure detection service

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Throughput optimal total order broadcast for cluster environments

ACM Transactions on Computer Systems (TOCS)
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
dFault: fault localization in large-scale peer-to-peer systems

Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware

Quantified Score

Hi-index	0.00

Visualization

Abstract

FUSE is a lightweight failure notification service for building distributed systems. Distributed systems built with FUSE are guaranteed that failure notifications never fail. Whenever a failure notification is triggered, all live members of the FUSE group will hear a notification within a bounded period of time, irrespective of node or communication failures. In contrast to previous work on failure detection, the responsibility for deciding that a failure has occurred is shared between the FUSE service and the distributed application. This allows applications to implement their own definitions of failure. Our experience building a scalable distributed event delivery system on an overlay network has convinced us of the usefulness of this service. Our results demonstrate that the network costs of each FUSE group can be small; in particular, our overlay network implementation requires no additional liveness-verifying ping traffic beyond that already needed to maintain the overlay, making the steady state network load independent of the number of active FUSE groups.