A Flexible Clustered Approach to High Availability

  • Authors:
  • Gary Hughes-Fenchel

  • Affiliations:
  • -

  • Venue:
  • FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Reliable Clustered Computing project created a system which enables applications to improve the reliability of off the shelf computers from a typical 99% (about 90 hours of downtime per year) to 99.99% (under one hour of downtime per year) in a cost-effective manner. The chief constrants were the need to achieve high reliability while minimizing cost and maintaining vendor independence. This was realized by creating a vendor independent clustered configuration comprised of two or more computers capable of recovering from hardware or software errors by restarting one or more processes on the current machine or by failing over one or more processes to another machine. Only two inexpensive custom hardware components were required for this solution: a WatchDog, to monitor component status, and a PowerDog, to control electrical power to processing elements (and optional peripherals). The bulk of the functionality was provided by software.