Using Time Instead of Timeout for Fault-Tolerant Distributed Systems.
ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed programming in Argus
Communications of the ACM
Knowledge and common knowledge in a distributed environment
Journal of the ACM (JACM)
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
ACM Transactions on Computer Systems (TOCS)
Practical Byzantine fault tolerance
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
On scalable and efficient distributed failure detectors
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
OSPF: Anatomy of an Internet Routing Protocol
OSPF: Anatomy of an Internet Routing Protocol
On the Quality of Service of Failure Detectors
IEEE Transactions on Computers
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems
Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
SCRIBE: The Design of a Large-Scale Event Notification Infrastructure
NGC '01 Proceedings of the Third International COST264 Workshop on Networked Group Communication
Failure Detectors as First Class Objects
DOA '99 Proceedings of the International Symposium on Distributed Objects and Applications
Experimental Study of Internet Stability and Backbone Failures
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications
WIAPP '03 Proceedings of the The Third IEEE Workshop on Internet Applications
Scalability and accuracy in a large-scale network emulator
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
A comparison of hard-state and soft-state signaling protocols
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Herald: Achieving a Global Event Notification Service
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
ACM SIGCOMM Computer Communication Review
Adding High Availability and Autonomic Behavior to Web Services
Proceedings of the 26th International Conference on Software Engineering
Reliable Distributed Systems: Technologies, Web Services, and Applications
Reliable Distributed Systems: Technologies, Web Services, and Applications
Using runtime paths for macroanalysis
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Consistent and automatic replica regeneration
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Total recall: system support for automated availability management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
SkipNet: a scalable overlay network with practical locality properties
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Virtual ring routing: network routing inspired by DHTs
Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Latency and bandwidth-minimizing failure detectors
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Towards highly reliable enterprise network services via inference of multi-level dependencies
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Design of the notification system for failure detectors
International Journal of High Performance Computing and Networking
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
International Journal of Network Management
Why should we integrate services, servers, and networking in a data center?
Proceedings of the 1st ACM workshop on Research on enterprise networking
Predicting and preventing inconsistencies in deployed distributed systems
ACM Transactions on Computer Systems (TOCS)
Skip ring topology in fast failure detection service
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Throughput optimal total order broadcast for cluster environments
ACM Transactions on Computer Systems (TOCS)
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
dFault: fault localization in large-scale peer-to-peer systems
Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
Hi-index | 0.00 |
FUSE is a lightweight failure notification service for building distributed systems. Distributed systems built with FUSE are guaranteed that failure notifications never fail. Whenever a failure notification is triggered, all live members of the FUSE group will hear a notification within a bounded period of time, irrespective of node or communication failures. In contrast to previous work on failure detection, the responsibility for deciding that a failure has occurred is shared between the FUSE service and the distributed application. This allows applications to implement their own definitions of failure. Our experience building a scalable distributed event delivery system on an overlay network has convinced us of the usefulness of this service. Our results demonstrate that the network costs of each FUSE group can be small; in particular, our overlay network implementation requires no additional liveness-verifying ping traffic beyond that already needed to maintain the overlay, making the steady state network load independent of the number of active FUSE groups.