Detecting failures in distributed systems with the Falcon spy network

Authors:
Joshua B. Leners;Hao Wu;Wei-Lun Hung;Marcos K. Aguilera;Michael Walfish
Affiliations:
The University of Texas at Austin;The University of Texas at Austin;The University of Texas at Austin;Microsoft Research Silicon Valley;The University of Texas at Austin
Venue:
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Year:
2011

Citing 38
Cited 4

Exploiting virtual synchrony in distributed systems

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Consensus in the presence of partial synchrony

Journal of the ACM (JACM)
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Petal: distributed virtual disks

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Revisiting the PAXOS algorithm

Theoretical Computer Science
End-to-end arguments in system design

ACM Transactions on Computer Systems (TOCS)
The ABCD's of Paxos

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
The Timely Computing Base Model and Architecture

IEEE Transactions on Computers
Perfect Failure Detection in Timed Asynchronous Systems

IEEE Transactions on Computers
On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems

DISC '02 Proceedings of the 16th International Conference on Distributed Computing
he Timely Computing Base: Timely Actions in the Presence of Uncertain Timeliness

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Implementation and Performance Evaluation of an Adaptable Failure Detector

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A principle for resilient sharing of distributed resources

ICSE '76 Proceedings of the 2nd international conference on Software engineering
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
The " Accrual Failure Detector

SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Boxwood: abstractions as the foundation for storage infrastructure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Paxos made live: an engineering perspective

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
The Chubby lock service for loosely-coupled distributed systems

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
The Paxos Register

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Consensus routing: the internet as a distributed system

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Paxos for System Builders: an overview

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Flexible, wide-area storage for distributed systems with WheelFS

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
No time for asynchrony

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
On the impossibility of implementing perpetual failure detectors in partially synchronous systems

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
The turtles project: design and implementation of nested virtualization

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Paxos replicated state machines as the basis of a high-performance data store

Proceedings of the 8th USENIX conference on Networked systems design and implementation

Consistency and fault tolerance for erasure-coded distributed storage systems

Proceedings of the fifth international workshop on Data-Intensive Distributed Computing Date
All about Eve: execute-verify replication for multi-core servers

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
On fault resilience of OpenStack

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common way for a distributed system to tolerate crashes is to explicitly detect them and then recover from them. Interestingly, detection can take much longer than recovery, as a result of many advances in recovery techniques, making failure detection the dominant factor in these systems' unavailability when a crash occurs. This paper presents the design, implementation, and evaluation of Falcon, a failure detector with several features. First, Falcon's common-case detection time is sub-second, which keeps unavailability low. Second, Falcon is reliable: it never reports a process as down when it is actually up. Third, Falcon sometimes kills to achieve reliable detection but aims to kill the smallest needed component. Falcon achieves these features by coordinating a network of spies, each monitoring a layer of the system. Falcon's main cost is a small amount of platform-specific logic. Falcon is thus the first failure detector that is fast, reliable, and viable. As such, it could change the way that a class of distributed systems is built.