Reliable data-center scale computations

  • Authors:
  • Pramod Bhatotia;Alexander Wieder;Rodrigo Rodrigues;Flavio Junqueira;Benjamin Reed

  • Affiliations:
  • Max Planck Institute for Software Systems (MPI-SWS);Max Planck Institute for Software Systems (MPI-SWS);Max Planck Institute for Software Systems (MPI-SWS);Yahoo! Research;Yahoo! Research

  • Venue:
  • Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Neither of the two broad classes of fault models considered by traditional fault tolerance techniques --- crash and Byzantine faults --- suit the environment of systems that run in today's data centers. On the one hand, assuming Byzantine faults is considered overkill due to the assumption of a worst-case adversarial behavior, and the use of other techniques to guard against malicious attacks. On the other hand, the crash fault model is insufficient since it does not capture non-crash faults that may result from a variety of unexpected conditions that are commonplace in this setting. In this paper, we present the case for a more practical approach at handling non-crash (but non-adversarial) faults in data-center scale computations. In this context, we discuss how such problem can be tackled for an important class of data-center scale systems: systems for large-scale processing of data, with a particular focus on the Pig programming framework. Such an approach not only covers a significant fraction of the processing jobs that run in today's data centers, but is potentially applicable to a broader class of applications.