CEC: Continuous eventual checkpointing for data stream processing operators

Authors:
Zoe Sebepou;Kostas Magoutis
Affiliations:
Institute of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH), Heraklion GR-70013, Crete, Greece;Institute of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH), Heraklion GR-70013, Crete, Greece
Venue:
DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
Year:
2011

Citing 0
Cited 4

Real-Time analysis of localization data streams for ambient intelligence environments

AmI'11 Proceedings of the Second international conference on Ambient Intelligence
Integrating scale out and fault tolerance in stream processing using operator state management

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Rollback-recovery without checkpoints in distributed event processing systems

Proceedings of the 7th ACM international conference on Distributed event-based systems
MigCEP: operator migration for mobility driven distributed complex event processing

Proceedings of the 7th ACM international conference on Distributed event-based systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The checkpoint roll-backward methodology is the underlying technology of several fault-tolerance solutions for continuous stream processing systems today, implemented either using the memories of replica nodes or a distributed file system. In this scheme the recovering node loads its most recent checkpoint and requests log replay to reach a consistent pre-failure state. Challenges with that technique include its complexity (typically implemented via copy-on-write), the associated overhead (exception handling under state updates), and limits to the frequency of checkpointing. The latter limit affects the amount of information that needs to be replayed leading to long recovery times. In this work we introduce continuous eventual checkpointing (CEC), a novel mechanism to provide fault-tolerance guarantees by taking continuous incremental state checkpoints with minimal pausing of operator processing. We achieve this by separating operator state into independent parts and producing frequent independent partial checkpoints of them. Our results show that our method can achieve low overhead fault-tolerance with adjustable checkpoint intensity, trading off recovery time with performance.