Exploiting virtual synchrony in distributed systems
SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Consensus in the presence of partial synchrony
Journal of the ACM (JACM)
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Petal: distributed virtual disks
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
ACM Transactions on Computer Systems (TOCS)
Revisiting the PAXOS algorithm
Theoretical Computer Science
End-to-end arguments in system design
ACM Transactions on Computer Systems (TOCS)
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
On the Quality of Service of Failure Detectors
IEEE Transactions on Computers
The Timely Computing Base Model and Architecture
IEEE Transactions on Computers
Perfect Failure Detection in Timed Asynchronous Systems
IEEE Transactions on Computers
On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems
DISC '02 Proceedings of the 16th International Conference on Distributed Computing
he Timely Computing Base: Timely Actions in the Presence of Uncertain Timeliness
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Implementation and Performance Evaluation of an Adaptable Failure Detector
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A principle for resilient sharing of distributed resources
ICSE '76 Proceedings of the 2nd international conference on Software engineering
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
The " Accrual Failure Detector
SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Boxwood: abstractions as the foundation for storage infrastructure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Paxos made live: an engineering perspective
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
The Chubby lock service for loosely-coupled distributed systems
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Remus: high availability via asynchronous virtual machine replication
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Consensus routing: the internet as a distributed system
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Paxos for System Builders: an overview
LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Flexible, wide-area storage for distributed systems with WheelFS
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
On the impossibility of implementing perpetual failure detectors in partially synchronous systems
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
The turtles project: design and implementation of nested virtualization
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Paxos replicated state machines as the basis of a high-performance data store
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Consistency and fault tolerance for erasure-coded distributed storage systems
Proceedings of the fifth international workshop on Data-Intensive Distributed Computing Date
All about Eve: execute-verify replication for multi-core servers
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Improving availability in distributed systems with failure informers
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
On fault resilience of OpenStack
Proceedings of the 4th annual Symposium on Cloud Computing
Hi-index | 0.00 |
A common way for a distributed system to tolerate crashes is to explicitly detect them and then recover from them. Interestingly, detection can take much longer than recovery, as a result of many advances in recovery techniques, making failure detection the dominant factor in these systems' unavailability when a crash occurs. This paper presents the design, implementation, and evaluation of Falcon, a failure detector with several features. First, Falcon's common-case detection time is sub-second, which keeps unavailability low. Second, Falcon is reliable: it never reports a process as down when it is actually up. Third, Falcon sometimes kills to achieve reliable detection but aims to kill the smallest needed component. Falcon achieves these features by coordinating a network of spies, each monitoring a layer of the system. Falcon's main cost is a small amount of platform-specific logic. Falcon is thus the first failure detector that is fast, reliable, and viable. As such, it could change the way that a class of distributed systems is built.