Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Cluster-based scalable network services
Proceedings of the sixteenth ACM symposium on Operating systems principles
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Experiences with object group systems
Software—Practice & Experience
Applying Patterns to Improve the Performance of Fault Tolerant CORBA
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
DOORS: Towards High-Performance Fault Tolerant CORBA
DOA '00 Proceedings of the International Symposium on Distributed Objects and Applications
Exactly-Once End-to-End Semantics in CORBA Invocations Across Heterogeneous Fault-Tolerant ORBs
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Fault Tolerance in Three-Tier Applications: Focusing on the Database Tier
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Experiences, Strategies, and Challenges in Building Fault-Tolerant CORBA Systems
IEEE Transactions on Computers
Towards Real-Time Fault-Tolerant CORBA Middleware
Cluster Computing
Hibernate in Action (In Action series)
Hibernate in Action (In Action series)
Adding group communication and fault-tolerance to CORBA
COOTS'95 Proceedings of the USENIX Conference on Object-Oriented Technologies on USENIX Conference on Object-Oriented Technologies (COOTS)
Towards Autonomic Fault Recovery in System-S
ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
SPC: a distributed, scalable platform for data mining
Proceedings of the 4th international workshop on Data mining standards, services and platforms
The Chubby lock service for loosely-coupled distributed systems
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
SPADE: the system s declarative stream processing engine
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Cluster Computing
SODA: an optimizing scheduler for large-scale stream-based distributed computer systems
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems
ICDCS '09 Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems
Proceedings of the 2nd Workshop on High Performance Computational Finance
Job Admission and Resource Allocation in Distributed Streaming Systems
Job Scheduling Strategies for Parallel Processing
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Design principles for developing stream processing applications
Software—Practice & Experience - Focus on Selected PhD Literature Reviews in the Practical Aspects of Software Technology
Modeling stream processing applications for dependability evaluation
DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
An adaptive quality of service aware middleware for replicated services
IEEE Transactions on Parallel and Distributed Systems
Clustering support and replication management for scalable network services
IEEE Transactions on Parallel and Distributed Systems
Integrating scale out and fault tolerance in stream processing using operator state management
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
An evaluation of zookeeper for high availability in system S
Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Hi-index | 0.00 |
We describe a fault-tolerance technique for implementing operations in a large-scale distributed system that ensures that all the components will eventually have a consistent view of the system even in the face of component failures. To achieve this, we break the distributed operation into a series of smaller operations, each of which is local to a single component, carefully linked together. Thus, the effect of a component failure and restart in the middle of a multi-component operation is limited to that component and its immediate neighbors. This framework is used in System S, a commercial grade stream processing platform. In that context we will show empirically that our approach is effective and imposes low overhead on distributed inter-component operations.