Improving availability in distributed systems with failure informers

Authors:
Joshua B. Leners;Trinabh Gupta;Marcos K. Aguilera;Michael Walfish
Affiliations:
The University of Texas at Austin;The University of Texas at Austin;Microsoft Research Silicon Valley;The University of Texas at Austin
Venue:
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Year:
2013

Citing 50
Cited 1

Exploiting virtual synchrony in distributed systems

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
The design philosophy of the DARPA internet protocols

SIGCOMM '88 Symposium proceedings on Communications architectures and protocols
Leases: an efficient fault-tolerant mechanism for distributed file cache consistency

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
The structuring of systems using upcalls

Proceedings of the tenth ACM symposium on Operating systems principles
When the CRC and TCP checksum disagree

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
King: estimating latency between arbitrary internet end hosts

Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment
Perfect Failure Detection in Timed Asynchronous Systems

IEEE Transactions on Computers
Bayesian approaches to failure prediction for disk drives

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Implementation and Performance Evaluation of an Adaptable Failure Detector

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A principle for resilient sharing of distributed resources

ICSE '76 Proceedings of the 2nd international conference on Software engineering
A knowledge plane for the internet

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
The " Accrual Failure Detector

SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Explicit transport error notification (ETEN) for error-prone wireless and satellite networks

Computer Networks: The International Journal of Computer and Telecommunications Networking - Special issue: Networking for the earth science
Meridian: a lightweight network location service without virtual coordinates

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
InfoSpect: using a logic language for system health monitoring in distributed systems

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Towards unbiased end-to-end network diagnosis

Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
OSPF monitoring: architecture, design and deployment experience

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
IP fault localization via risk modeling

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
PlanetSeer: internet path failure monitoring and characterization in wide-area services

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Reclaiming network-wide visibility using ubiquitous endsystem monitors

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Latency and bandwidth-minimizing failure detectors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Achieving convergence-free routing using failure-carrying packets

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
The Chubby lock service for loosely-coupled distributed systems

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
iPlane: an information plane for distributed services

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data

CoNEXT '07 Proceedings of the 2007 ACM CoNEXT conference
Path-quality monitoring in the presence of adversaries

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Floodless in seattle: a scalable ethernet architecture for large enterprises

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
A scalable, commodity data center network architecture

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Network exception handlers: host-network control in enterprise networks

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Packet-dropping adversary identification for data plane security

CoNEXT '08 Proceedings of the 2008 ACM CoNEXT Conference
iPlane Nano: path prediction for peer-to-peer applications

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Unraveling the complexity of network management

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
SafeGuard: safe forwarding during route changes

Proceedings of the 5th international conference on Emerging networking experiments and technologies
Protocols and lower bounds for failure localization in the internet

EUROCRYPT'08 Proceedings of the theory and applications of cryptographic techniques 27th annual international conference on Advances in cryptology
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Verifiable network-performance measurements

Proceedings of the 6th International COnference
NetQuery: a knowledge plane for reasoning about network properties

Proceedings of the ACM SIGCOMM 2011 conference
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting failures in distributed systems with the Falcon spy network

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
An OSPF topology server: design and evaluation

IEEE Journal on Selected Areas in Communications
Ensuring connectivity via data plane mechanisms

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

Failure recovery: when the cure is worse than the disease

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses a core question in distributed systems: how should applications be notified of failures? When a distributed system acts on failure reports, the system's correctness and availability depend on the granularity and semantics of those reports. The system's availability also depends on coverage (failures are reported), accuracy (reports are justified), and timeliness (reports come quickly). This paper describes Pigeon, a failure reporting service designed to enable high availability in the applications that use it. Pigeon exposes a new abstraction, called a failure informer, which allows applications to take informed, application-specific recovery actions, and which encapsulates uncertainty, allowing applications to proceed safely in the presence of doubt. Pigeon also significantly improves over the previous state of the art in the three-way trade-off among coverage, accuracy, and timeliness.