Reliable data-center scale computations

Authors:
Pramod Bhatotia;Alexander Wieder;Rodrigo Rodrigues;Flavio Junqueira;Benjamin Reed
Affiliations:
Max Planck Institute for Software Systems (MPI-SWS);Max Planck Institute for Software Systems (MPI-SWS);Max Planck Institute for Software Systems (MPI-SWS);Yahoo! Research;Yahoo! Research
Venue:
Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Year:
2010

Citing 14
Cited 0

Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Byzantine disk paxos: optimal resilience with byzantine shared memory

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Autopilot: automatic data center management

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
PeerReview: practical accountability for distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Distributed computing in SOSP and OSDI

ACM SIGACT News
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware

SAFECOMP '09 Proceedings of the 28th International Conference on Computer Safety, Reliability, and Security
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Spread-spectrum computation

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Neither of the two broad classes of fault models considered by traditional fault tolerance techniques --- crash and Byzantine faults --- suit the environment of systems that run in today's data centers. On the one hand, assuming Byzantine faults is considered overkill due to the assumption of a worst-case adversarial behavior, and the use of other techniques to guard against malicious attacks. On the other hand, the crash fault model is insufficient since it does not capture non-crash faults that may result from a variety of unexpected conditions that are commonplace in this setting. In this paper, we present the case for a more practical approach at handling non-crash (but non-adversarial) faults in data-center scale computations. In this context, we discuss how such problem can be tackled for an important class of data-center scale systems: systems for large-scale processing of data, with a particular focus on the Pig programming framework. Such an approach not only covers a significant fraction of the processing jobs that run in today's data centers, but is potentially applicable to a broader class of applications.