Scale and performance in a distributed file system
ACM Transactions on Computer Systems (TOCS)
Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
Communications of the ACM
The ObjectStore database system
Communications of the ACM
ACM Transactions on Database Systems (TODS)
Network flows: theory, algorithms, and applications
Network flows: theory, algorithms, and applications
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Shoring up persistent applications
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
QuickStore: a high performance mapped object store
The VLDB Journal — The International Journal on Very Large Data Bases
Main Memory Database Systems: An Overview
IEEE Transactions on Knowledge and Data Engineering
Incremental Recovery in Main Memory Database Systems
IEEE Transactions on Knowledge and Data Engineering
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the 17th International Conference on Data Engineering
A Study of Index Structures for Main Memory Database Management Systems
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Design, Implementation, and Performance of Checkpointing in NetSolve
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
MigThread: Thread Migration in DSM Systems
ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
Gigascope: a stream database for network applications
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Aurora: a new model and architecture for data stream management
The VLDB Journal — The International Journal on Very Large Data Bases
Highly available, fault-tolerant, parallel dataflows
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
High-Availability Algorithms for Distributed Stream Processing
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fault-tolerance in the Borealis distributed stream processing system
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Linear hashing: a new tool for file and table addressing
VLDB '80 Proceedings of the sixth international conference on Very Large Data Bases - Volume 6
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
Application-Level checkpointing techniques for parallel programs
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Proceedings of the Fourteenth International Database Engineering & Applications Symposium
Proceedings of the VLDB Endowment
A latency and fault-tolerance optimizer for online parallel query plans
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fault injection-based assessment of partial fault tolerance in stream processing applications
Proceedings of the 5th ACM international conference on Distributed event-based system
Improving Bandwidth Efficiency for Consistent Multistream Storage
ACM Transactions on Storage (TOS)
Pollux: towards scalable distributed real-time search on microblogs
Proceedings of the 16th International Conference on Extending Database Technology
Rollback-recovery without checkpoints in distributed event processing systems
Proceedings of the 7th ACM international conference on Distributed event-based systems
MillWheel: fault-tolerant stream processing at internet scale
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
We present SGuard, a new fault-tolerance technique for distributed stream processing engines (SPEs) running in clusters of commodity servers. SGuard is less disruptive to normal stream processing and leaves more resources available for normal stream processing than previous proposals. Like several previous schemes, SGuard is based on rollback recovery [18]: it checkpoints the state of stream processing nodes periodically and restarts failed nodes from their most recent checkpoints. In contrast to previous proposals, however, SGuard performs checkpoints asynchronously: i.e., operators continue processing streams during the checkpoint thus reducing the potential disruption due to the checkpointing activity. Additionally, SGuard saves the checkpointed state into a new type of distributed and replicated file system (DFS) such as GFS [22] or HDFS [9], leaving more memory resources available for normal stream processing. To manage resource contention due to simultaneous checkpoints by different SPE nodes, SGuard adds a scheduler to the DFS. This scheduler coordinates large batches of write requests in a manner that reduces individual checkpoint times while maintaining good overall resource utilization. We demonstrate the effectiveness of the approach through measurements of a prototype implementation in the Borealis [2] open-source SPE using HDFS [9] as the DFS.