Tiered fault tolerance for long-term integrity

  • Authors:
  • Byung-Gon Chun;Petros Maniatis;Scott Shenker;John Kubiatowicz

  • Affiliations:
  • Intel Research Berkeley;Intel Research Berkeley;University of California at Berkeley;University of California at Berkeley

  • Venue:
  • FAST '09 Proccedings of the 7th conference on File and storage technologies
  • Year:
  • 2009

Quantified Score

Hi-index 0.02

Visualization

Abstract

Fault-tolerant services typically make assumptions about the type and maximum number of faults that they can tolerate while providing their correctness guarantees; when such a fault threshold is violated, correctness is lost. We revisit the notion of fault thresholds in the context of long-term archival storage. We observe that fault thresholds are inevitably violated in long-term services, making traditional fault tolerance inapplicable to the long-term. In this work, we undertake a "reallocation of the fault-tolerance budget" of a long-term service. We split the service into service pieces, each of which can tolerate a different number of faults without failing (and without causing the whole service to fail): each piece can be either in a critical trusted fault tier, which must never fail, or an untrusted fault tier, which can fail massively and often, or other fault tiers in between. By carefully engineering the split of a long-term service into pieces that must obey distinct fault thresholds, we can prolong its inevitable demise. We demonstrate this approach with Bonafide, a long-term key-value store that, unlike all similar systems proposed in the literature, maintains integrity in the face of Byzantine faults without requiring self-certified data. We describe the notion of tiered fault tolerance, the design, implementation, and experimental evaluation of Bonafide, and argue that our approach is a practical yet significant improvement over the state of the art for long-term services.