Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications

  • Authors:
  • Thanadech Thanakornworakij;Raja Nassar;Chokchai Box Leangsuksun;Mihaela Paun

  • Affiliations:
  • College of Engineering and Science, Louisiana Tech University, Ruston, LA, USA;College of Engineering and Science, Louisiana Tech University, Ruston, LA, USA;College of Engineering and Science, Louisiana Tech University, Ruston, LA, USA;College of Engineering and Science, Louisiana Tech University, Ruston, LA, USA, National Institute of Research and Development for Biological Sciences, Bucharest, Romania

  • Venue:
  • International Journal of High Performance Computing Applications
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

A high-performance computing (HPC) system, which is composed of a large number of components, is prone to failure. To maximize HPC system utilization, one should understand the failure behavior and the reliability of the system. Studies in the literature show that the time to failure of a node is best described by a Weibull distribution. In this study, we consider, without loss of generality, the Weibull as the distribution of time to failure and develop a reliability model for a system of k nodes where nodes can fail simultaneously. From this model, we develop expressions for the probability of failure of the system at any time t, for the failure rate, and for the mean time to failure. Also, we validate the model by using failure data from the Blue Gene/L logs obtained from the Lawrence Livermore National Laboratory. Results show that if failures of the components (nodes) in the system possess a degree of dependency, the system becomes less reliable, which means that the failure rate increases and the mean time to failure decreases. Also, an increase in the number of nodes decreases the reliability of the system.