Distributed middleware reliability and fault tolerance support in system S

Authors:
Rohit Wagle;Henrique Andrade;Kirsten Hildrum;Chitra Venkatramani;Michael Spicer
Affiliations:
IBM T.J. Watson Research Center, Hawthorne, NY, USA;Goldman Sachs, New York, NY, USA;IBM T.J. Watson Research Center, Hawthorne, NY, USA;IBM T.J. Watson Research Center, Hawthorne, NY, USA;IBM T.J. Watson Research Center, Hawthorne, NY, USA
Venue:
Proceedings of the 5th ACM international conference on Distributed event-based system
Year:
2011

Citing 29
Cited 2

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Cluster-based scalable network services

Proceedings of the sixteenth ACM symposium on Operating systems principles
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Experiences with object group systems

Software—Practice & Experience
Applying Patterns to Improve the Performance of Fault Tolerant CORBA

HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
DOORS: Towards High-Performance Fault Tolerant CORBA

DOA '00 Proceedings of the International Symposium on Distributed Objects and Applications
Exactly-Once End-to-End Semantics in CORBA Invocations Across Heterogeneous Fault-Tolerant ORBs

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Fault Tolerance in Three-Tier Applications: Focusing on the Database Tier

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Experiences, Strategies, and Challenges in Building Fault-Tolerant CORBA Systems

IEEE Transactions on Computers
Towards Real-Time Fault-Tolerant CORBA Middleware

Cluster Computing
Hibernate in Action (In Action series)

Hibernate in Action (In Action series)
Adding group communication and fault-tolerance to CORBA

COOTS'95 Proceedings of the USENIX Conference on Object-Oriented Technologies on USENIX Conference on Object-Oriented Technologies (COOTS)
Towards Autonomic Fault Recovery in System-S

ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
SPC: a distributed, scalable platform for data mining

Proceedings of the 4th international workshop on Data mining standards, services and platforms
The Chubby lock service for loosely-coupled distributed systems

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Challenges and experience in prototyping a multi-modal stream analytic and monitoring application on System S

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Self healing in System-S

Cluster Computing
SODA: an optimizing scheduler for large-scale stream-based distributed computer systems

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems

ICDCS '09 Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems
Implementing a high-volume, low-latency market data processing system on commodity hardware using IBM middleware

Proceedings of the 2nd Workshop on High Performance Computational Finance
Job Admission and Resource Allocation in Distributed Streaming Systems

Job Scheduling Strategies for Parallel Processing
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Design principles for developing stream processing applications

Software—Practice & Experience - Focus on Selected PhD Literature Reviews in the Practical Aspects of Software Technology
Modeling stream processing applications for dependability evaluation

DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
An adaptive quality of service aware middleware for replicated services

IEEE Transactions on Parallel and Distributed Systems
Clustering support and replication management for scalable network services

IEEE Transactions on Parallel and Distributed Systems

Integrating scale out and fault tolerance in stream processing using operator state management

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
An evaluation of zookeeper for high availability in system S

Proceedings of the 5th ACM/SPEC international conference on Performance engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a fault-tolerance technique for implementing operations in a large-scale distributed system that ensures that all the components will eventually have a consistent view of the system even in the face of component failures. To achieve this, we break the distributed operation into a series of smaller operations, each of which is local to a single component, carefully linked together. Thus, the effect of a component failure and restart in the middle of a multi-component operation is limited to that component and its immediate neighbors. This framework is used in System S, a commercial grade stream processing platform. In that context we will show empirically that our approach is effective and imposes low overhead on distributed inter-component operations.