Hypervisor-based fault tolerance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Highly available, fault-tolerant, parallel dataflows
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
High-Availability Algorithms for Distributed Stream Processing
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fault-tolerance in the Borealis distributed stream processing system
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Adaptive Control of Extreme-scale Stream Processing Systems
ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Failure Recovery in Cooperative Data Stream Analysis
ARES '07 Proceedings of the The Second International Conference on Availability, Reliability and Security
Towards Autonomic Fault Recovery in System-S
ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
Borealis-R: a replication-transparent stream processing system for wide-area monitoring applications
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Remus: high availability via asynchronous virtual machine replication
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
SODA: an optimizing scheduler for large-scale stream-based distributed computer systems
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
CLASP: collaborating, autonomous stream processing systems
Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Fast and Reliable Stream Processing over Wide Area Networks
ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
Efficient and coordinated checkpointing for reliable distributed data stream management
ADBIS'06 Proceedings of the 10th East European conference on Advances in Databases and Information Systems
Fault injection-based assessment of partial fault tolerance in stream processing applications
Proceedings of the 5th ACM international conference on Distributed event-based system
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Integrating scale out and fault tolerance in stream processing using operator state management
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Hi-index | 0.00 |
High availability (HA) is critical for many stream processing applications such as financial data analysis and disaster response. Existing HA schemes use either active standby or passive standby to guard the system against unexpected failures such as machine crash. Despite previous efforts of simulation-based studies that report active standby is superior, there is a lack of in-depth understanding of the tradeoff between different HA approaches under practical settings. In this paper, we propose a novel sweeping checkpointing method that can reduce the overhead by one order of magnitude. Whereas most previous work addresses single failures, we prove that the sweeping checkpointing method ensures no loss of data even against multiple concurrent failures. We then implement and compare the resulting passive standby variant against active standby using a real stream processing system. We find that passive standby presents a different tradeoff from active standby: longer recovery time, but 90% less overhead. Thus each approach has its suitable scenarios.