RAFT at work: speeding-up mapreduce applications under task and node failures

Authors:
Jorge-Arnulfo Quiané-Ruiz;Christoph Pinkel;Jörg Schad;Jens Dittrich
Affiliations:
Saarland University, Saarbrücken, Germany;Saarland University, Saarbrücken, Germany;Saarland University, Saarbrücken, Germany;Saarland University, Saarbrücken, Germany
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 12
Cited 2

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Checkpointing Memory-Resident Databases

Proceedings of the Fifth International Conference on Data Engineering
SIREN: A Memory-Conserving, Snapshot-Consistent Checkpoint Algorithm for in-Memory Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Fault-tolerance in the borealis distributed stream processing system

ACM Transactions on Database Systems (TODS)
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
Runtime measurements in the cloud: observing, analyzing, and reducing variance

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
RAFTing MapReduce: Fast recovery on the RAFT

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
MapReduce "garbage" collection

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The MapReduce framework is typically deployed on very large computing clusters where task and node failures are no longer an exception but the rule. Thus, fault-tolerance is an important aspect for the efficient operation of MapReduce jobs. However, currently MapReduce implementations fully recompute failed tasks (subparts of a job) from the beginning. This can significantly decrease the runtime performance of MapReduce applications. We present an alternative system that implements RAFT ideas. RAFT is a family of powerful and inexpensive Recovery Algorithms for Fast-Tracking MapReduce jobs under task and node failures. To recover from task failures, RAFT exploits the intermediate results persisted by MapReduce at several points in time. RAFT piggybacks checkpoints on the task progress computation. To recover from node failures, RAFT maintains a per-map task list of all input key-value pairs producing intermediate results and pushes intermediate results to reducers. In this demo, we demonstrate that RAFT recovers efficiently from both task and node failures. Further, the audience can compare RAFT with Hadoop via an easy-to-use web interface.