Practical Byzantine fault tolerance
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Byzantine disk paxos: optimal resilience with byzantine shared memory
Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Autopilot: automatic data center management
ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
PeerReview: practical accountability for distributed systems
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Distributed computing in SOSP and OSDI
ACM SIGACT News
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware
SAFECOMP '09 Proceedings of the 28th International Conference on Computer Safety, Reliability, and Security
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Hi-index | 0.00 |
Neither of the two broad classes of fault models considered by traditional fault tolerance techniques --- crash and Byzantine faults --- suit the environment of systems that run in today's data centers. On the one hand, assuming Byzantine faults is considered overkill due to the assumption of a worst-case adversarial behavior, and the use of other techniques to guard against malicious attacks. On the other hand, the crash fault model is insufficient since it does not capture non-crash faults that may result from a variety of unexpected conditions that are commonplace in this setting. In this paper, we present the case for a more practical approach at handling non-crash (but non-adversarial) faults in data-center scale computations. In this context, we discuss how such problem can be tackled for an important class of data-center scale systems: systems for large-scale processing of data, with a particular focus on the Pig programming framework. Such an approach not only covers a significant fraction of the processing jobs that run in today's data centers, but is potentially applicable to a broader class of applications.