Reliability Analysis of Clustered Computing Systems

  • Authors:
  • Veena B. Mendiratta

  • Affiliations:
  • -

  • Venue:
  • ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustered computing systems, using commercially available computers networked in a loosely-coupled fashion, can provide high levels of reliability if appropriate levels of error detection and recovery software are implemented in the middleware and application layers. In this paper we present a modeling approach for analyzing the hardware and software reliability of clustered computing systems. The clustered system is modeled as an irreducible Markov chain with working and failed states, and intermediate recovery states. The failure and recovery behavior is characterized in terms of the frequency and duration of fault recoveries and outages for a single processor in the cluster and for the entire clustered system. We apply the model to a telecommunication switching system application that uses the Lucent Technologies Reliable Clustered Computing product. The model results are presented for a range of values of the processor failure rate and the fault recovery coverage factor.