Improving availability in distributed systems with failure informers

  • Authors:
  • Joshua B. Leners;Trinabh Gupta;Marcos K. Aguilera;Michael Walfish

  • Affiliations:
  • The University of Texas at Austin;The University of Texas at Austin;Microsoft Research Silicon Valley;The University of Texas at Austin

  • Venue:
  • nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses a core question in distributed systems: how should applications be notified of failures? When a distributed system acts on failure reports, the system's correctness and availability depend on the granularity and semantics of those reports. The system's availability also depends on coverage (failures are reported), accuracy (reports are justified), and timeliness (reports come quickly). This paper describes Pigeon, a failure reporting service designed to enable high availability in the applications that use it. Pigeon exposes a new abstraction, called a failure informer, which allows applications to take informed, application-specific recovery actions, and which encapsulates uncertainty, allowing applications to proceed safely in the presence of doubt. Pigeon also significantly improves over the previous state of the art in the three-way trade-off among coverage, accuracy, and timeliness.