Rollback-recovery without checkpoints in distributed event processing systems

  • Authors:
  • Boris Koldehofe;Ruben Mayer;Umakishore Ramachandran;Kurt Rothermel;Marco Völz

  • Affiliations:
  • University of Stuttgart, Stuttgart, Germany;University of Stuttgart, Stuttgart, Germany;Georgia Institute of Technology, Atlanta, GA, USA;University of Stuttgart, Stuttgart, Germany;University of Stuttgart, Stuttgart, Germany

  • Venue:
  • Proceedings of the 7th ACM international conference on Distributed event-based systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Reliability is of critical importance to many applications involving distributed event processing systems. Especially the use of stateful operators makes it challenging to provide efficient recovery from failures and to ensure consistent event streams. Even during failure-free execution, state-of-the-art methods for achieving reliability incur significant overhead at run-time concerning computational resources, event traffic, and event detection time. This paper proposes a novel method for rollback-recovery that allows for recovery from multiple simultaneous operator failures, but eliminates the need for persistent checkpoints. Thereby, the operator state is preserved in \emph{savepoints} at points in time when its execution solely depends on the state of incoming event streams which are reproducible by predecessor operators. We propose an expressive event processing model to determine savepoints and algorithms for their coordination in a distributed operator network. Evaluations show that very low overhead at failure-free execution in comparison to other approaches is achieved.