Reliable communication in the presence of failures
ACM Transactions on Computer Systems (TOCS)
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Log-based receiver-reliable multicast for distributed interactive simulation
SIGCOMM '95 Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Structured virtual synchrony: exploring the bounds of virtual synchronous group communication
EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Delta Four: A Generic Architecture for Dependable Distributed Computing
Delta Four: A Generic Architecture for Dependable Distributed Computing
Horus: A Flexible Group Communications System
Horus: A Flexible Group Communications System
Group Membership and View Synchrony in Partitionable Asynchronous Distributed Systems: Specifications
Group communication specifications: a comprehensive study
ACM Computing Surveys (CSUR)
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scalability of the microsoft cluster service
WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
A new heartbeat mechanism for large-scale cluster
APWeb'06 Proceedings of the 2006 international conference on Advanced Web and Network Technologies, and Applications
Monere: monitoring of service compositions for failure diagnosis
ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Hi-index | 0.00 |
The one issue that unites almost all approaches to distributed computing is the need to know whether certain components in the system have failed or are otherwise unavailable. When designing and building systems that need to function at a global scale, failure management needs to be considered a fundamental building block. This paper describes the development of a system-independent failure management service, which allows systems and applications to incorporate accurate detection of failed processes, nodes and networks, without the need for making compromises in their particular design.