Distributed middleware reliability and fault tolerance support in system S

  • Authors:
  • Rohit Wagle;Henrique Andrade;Kirsten Hildrum;Chitra Venkatramani;Michael Spicer

  • Affiliations:
  • IBM T.J. Watson Research Center, Hawthorne, NY, USA;Goldman Sachs, New York, NY, USA;IBM T.J. Watson Research Center, Hawthorne, NY, USA;IBM T.J. Watson Research Center, Hawthorne, NY, USA;IBM T.J. Watson Research Center, Hawthorne, NY, USA

  • Venue:
  • Proceedings of the 5th ACM international conference on Distributed event-based system
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe a fault-tolerance technique for implementing operations in a large-scale distributed system that ensures that all the components will eventually have a consistent view of the system even in the face of component failures. To achieve this, we break the distributed operation into a series of smaller operations, each of which is local to a single component, carefully linked together. Thus, the effect of a component failure and restart in the middle of a multi-component operation is limited to that component and its immediate neighbors. This framework is used in System S, a commercial grade stream processing platform. In that context we will show empirically that our approach is effective and imposes low overhead on distributed inter-component operations.