A Practical Approach for Zero' Downtime in an Operational Information System

  • Authors:
  • Ada Gavrilovska;Karsten Schwan;Van Oleson

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

An Operational Information System (OIS) supports a real-time view of an organization's information critical to its logistical business operations. A central component of an OIS is an engine that integrates data events captured from distributed, remote sources in order to derive meaningful real-time views of current operations. This Event Derivation Engine (EDE) continuously updates these views and also publishes them to a potentially large number of remote subscribers. The paper first describes a sample OIS and EDE in the context of an airline's operations. It then defines the performance and availability requirements to be met by this system, specifically focusing on the EDE component. One particular requirement for the EDE is that subscribers to its output events should not experience down-time due to EDE failures, crashes or increased processing loads. Toward this end, we develop and evaluate a practical technique for masking failures and for hiding the costs of recovery from EDE subscribers. This technique utilizes redundant EDEs that coordinate view replicas with a relaxed synchronous fault tolerance protocol. A combination of pre-and post-buffering of replicas is used to attain a solution that offers low response times (i.e., zero' downtime) whilealso preventing system failures in the presence of deterministic faults like ill-formed' messages. Parallelism realized via a cluster machine and application-specific techniques for reducing synchronization across replicas are used to scale a zero' downtime EDE to support the large number of subscribers it must service.