Low-Overhead Fault Tolerance for High-Throughput Data Processing Systems

Authors:
Andre Martin;Thomas Knauth;Stephan Creutz;Diogo Becker;Stefan Weigert;Christof Fetzer;Andrey Brito
Affiliations:
-;-;-;-;-;-;-
Venue:
ICDCS '11 Proceedings of the 2011 31st International Conference on Distributed Computing Systems
Year:
2011

Citing 0
Cited 7

Elastic complex event processing

Proceedings of the 8th Middleware Doctoral Symposium
Community-based analysis of netflow for early detection of security incidents

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
QoS monitoring in a cloud services environment: the SRT-15 approach

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Fault-tolerant complex event processing using customizable state machine-based operators

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive online scheduling in storm

Proceedings of the 7th ACM international conference on Distributed event-based systems
Tutorial: Elastic and Fault Tolerant Event Stream Processing using StreamMine3G

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Scalable and Real-Time Deep Packet Inspection

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The MapReduce programming paradigm proved to be a useful approach for building highly scalable data processing systems. One important reason for its success is simplicity, including the fault tolerance mechanisms. However, this simplicity comes at a price: efficiency. MapReduce's fault tolerance scheme stores too much intermediate information on disk. This inefficiency negatively affects job completion time. Furthermore, this inefficiency in particular forbids the application of MapReduce in near real-time scenarios where jobs need to produce results quickly. In this paper, we discuss an alternative fault tolerance scheme that is inspired by virtual synchrony. The key feature of our approach is a low-overhead deterministic execution. Deterministic execution reduces the amount of persistently stored information. In addition, because persisting intermediate results are no longer required for fault tolerance, we use more efficient communication techniques that considerably improve job completion time and throughput. Our contribution is twofold: (i) we enable the use of MapReduce for jobs ranging from seconds to a few tens of seconds, satisfying these deadlines even in the case of failures, (ii) we considerably reduce the fault tolerance overhead and as such the overhead of MapReduce in general. Our modifications are transparent to the application.