World wide failures

Authors:
Werner Vogels
Affiliations:
Cornell University, Ithaca, NY
Venue:
EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Year:
1996

Citing 8
Cited 6

Reliable communication in the presence of failures

ACM Transactions on Computer Systems (TOCS)
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Log-based receiver-reliable multicast for distributed interactive simulation

SIGCOMM '95 Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Structured virtual synchrony: exploring the bounds of virtual synchronous group communication

EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Delta Four: A Generic Architecture for Dependable Distributed Computing

Delta Four: A Generic Architecture for Dependable Distributed Computing
Horus: A Flexible Group Communications System

Horus: A Flexible Group Communications System
Group Membership and View Synchrony in Partitionable Asynchronous Distributed Systems: Specifications

Group Membership and View Synchrony in Partitionable Asynchronous Distributed Systems: Specifications

Group communication specifications: a comprehensive study

ACM Computing Surveys (CSUR)
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scalability of the microsoft cluster service

WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
A new heartbeat mechanism for large-scale cluster

APWeb'06 Proceedings of the 2006 international conference on Advanced Web and Network Technologies, and Applications
Monere: monitoring of service compositions for failure diagnosis

ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The one issue that unites almost all approaches to distributed computing is the need to know whether certain components in the system have failed or are otherwise unavailable. When designing and building systems that need to function at a global scale, failure management needs to be considered a fundamental building block. This paper describes the development of a system-independent failure management service, which allows systems and applications to incorporate accurate detection of failed processes, nodes and networks, without the need for making compromises in their particular design.