An empirical study of high availability in stream processing systems

Authors:
Yu Gu;Zhe Zhang;Fan Ye;Hao Yang;Minkyong Kim;Hui Lei;Zhen Liu
Affiliations:
University of Minnesota;North Carolina State University;IBM T.J. Watson Research Center, Hawthorne, NY;IBM T.J. Watson Research Center, Hawthorne, NY;IBM T.J. Watson Research Center, Hawthorne, NY;IBM T.J. Watson Research Center, Hawthorne, NY;Nokia Research China Lab, Beijing, China
Venue:
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Year:
2009

Citing 14
Cited 3

Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Highly available, fault-tolerant, parallel dataflows

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
High-Availability Algorithms for Distributed Stream Processing

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fault-tolerance in the Borealis distributed stream processing system

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Adaptive Control of Extreme-scale Stream Processing Systems

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Failure Recovery in Cooperative Data Stream Analysis

ARES '07 Proceedings of the The Second International Conference on Availability, Reliability and Security
Towards Autonomic Fault Recovery in System-S

ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
Borealis-R: a replication-transparent stream processing system for wide-area monitoring applications

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
SODA: an optimizing scheduler for large-scale stream-based distributed computer systems

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
CLASP: collaborating, autonomous stream processing systems

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Fast and Reliable Stream Processing over Wide Area Networks

ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
Efficient and coordinated checkpointing for reliable distributed data stream management

ADBIS'06 Proceedings of the 10th East European conference on Advances in Databases and Information Systems

Fault injection-based assessment of partial fault tolerance in stream processing applications

Proceedings of the 5th ACM international conference on Distributed event-based system
Task Scheduling Algorithm for Multicore Processor System for Minimizing Recovery Time in Case of Single Node Fault

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Integrating scale out and fault tolerance in stream processing using operator state management

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

High availability (HA) is critical for many stream processing applications such as financial data analysis and disaster response. Existing HA schemes use either active standby or passive standby to guard the system against unexpected failures such as machine crash. Despite previous efforts of simulation-based studies that report active standby is superior, there is a lack of in-depth understanding of the tradeoff between different HA approaches under practical settings. In this paper, we propose a novel sweeping checkpointing method that can reduce the overhead by one order of magnitude. Whereas most previous work addresses single failures, we prove that the sweeping checkpointing method ensures no loss of data even against multiple concurrent failures. We then implement and compare the resulting passive standby variant against active standby using a real stream processing system. We find that passive standby presents a different tradeoff from active standby: longer recovery time, but 90% less overhead. Thus each approach has its suitable scenarios.