Availability Requirement for a Fault Management Server in High-Availability Communication Systems
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Some successful approaches to software reliability modeling in industry
Journal of Systems and Software - Special issue: Automated component-based software engineering
Segregated failures model for availability evaluation of fault-tolerant systems
ACSC '06 Proceedings of the 29th Australasian Computer Science Conference - Volume 48
Proactive management of software aging
IBM Journal of Research and Development
Self-configuring algorithm for software fault tolerance in (n,k)-way cluster systems
ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
CSAR-2: a case study of parallel file system dependability analysis
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Hi-index | 0.00 |
Clustered computing systems, using commercially available computers networked in a loosely-coupled fashion, can provide high levels of reliability if appropriate levels of error detection and recovery software are implemented in the middleware and application layers. In this paper we present a modeling approach for analyzing the hardware and software reliability of clustered computing systems. The clustered system is modeled as an irreducible Markov chain with working and failed states, and intermediate recovery states. The failure and recovery behavior is characterized in terms of the frequency and duration of fault recoveries and outages for a single processor in the cluster and for the entire clustered system. We apply the model to a telecommunication switching system application that uses the Lucent Technologies Reliable Clustered Computing product. The model results are presented for a range of values of the processor failure rate and the fault recovery coverage factor.