Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications

Authors:
Thanadech Thanakornworakij;Raja Nassar;Chokchai Box Leangsuksun;Mihaela Paun
Affiliations:
College of Engineering and Science, Louisiana Tech University, Ruston, LA, USA;College of Engineering and Science, Louisiana Tech University, Ruston, LA, USA;College of Engineering and Science, Louisiana Tech University, Ruston, LA, USA;College of Engineering and Science, Louisiana Tech University, Ruston, LA, USA, National Institute of Research and Development for Biological Sciences, Bucharest, Romania
Venue:
International Journal of High Performance Computing Applications
Year:
2013

Citing 8
Cited 0

Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable, fault tolerant membership for MPI tasks on HPC systems

Proceedings of the 20th annual international conference on Supercomputing
Application Resilience: Making Progress in Spite of Failure

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Estimating the parameters of the Marshall-Olkin bivariate Weibull distribution by EM algorithm

Computational Statistics & Data Analysis
Achieving high availability and performance computing with an HA-OSCAR cluster

Future Generation Computer Systems - Special issue: High-speed networks and services for data-intensive grids: The DataTAG project

Quantified Score

Hi-index	0.00

Visualization

Abstract

A high-performance computing (HPC) system, which is composed of a large number of components, is prone to failure. To maximize HPC system utilization, one should understand the failure behavior and the reliability of the system. Studies in the literature show that the time to failure of a node is best described by a Weibull distribution. In this study, we consider, without loss of generality, the Weibull as the distribution of time to failure and develop a reliability model for a system of k nodes where nodes can fail simultaneously. From this model, we develop expressions for the probability of failure of the system at any time t, for the failure rate, and for the mean time to failure. Also, we validate the model by using failure data from the Blue Gene/L logs obtained from the Lawrence Livermore National Laboratory. Results show that if failures of the components (nodes) in the system possess a degree of dependency, the system becomes less reliable, which means that the failure rate increases and the mean time to failure decreases. Also, an increase in the number of nodes decreases the reliability of the system.