Reliability Analysis of Clustered Computing Systems

Authors:
Veena B. Mendiratta
Affiliations:
-
Venue:
ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
Year:
1998

Citing 0
Cited 6

Availability Requirement for a Fault Management Server in High-Availability Communication Systems

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Some successful approaches to software reliability modeling in industry

Journal of Systems and Software - Special issue: Automated component-based software engineering
Segregated failures model for availability evaluation of fault-tolerant systems

ACSC '06 Proceedings of the 29th Australasian Computer Science Conference - Volume 48
Proactive management of software aging

IBM Journal of Research and Development
Self-configuring algorithm for software fault tolerance in (n,k)-way cluster systems

ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
CSAR-2: a case study of parallel file system dependability analysis

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustered computing systems, using commercially available computers networked in a loosely-coupled fashion, can provide high levels of reliability if appropriate levels of error detection and recovery software are implemented in the middleware and application layers. In this paper we present a modeling approach for analyzing the hardware and software reliability of clustered computing systems. The clustered system is modeled as an irreducible Markov chain with working and failed states, and intermediate recovery states. The failure and recovery behavior is characterized in terms of the frequency and duration of fault recoveries and outages for a single processor in the cluster and for the entire clustered system. We apply the model to a telecommunication switching system application that uses the Lucent Technologies Reliable Clustered Computing product. The model results are presented for a range of values of the processor failure rate and the fault recovery coverage factor.