Dependability Measurement and Modeling of a Multicomputer System

  • Authors:
  • D. Tang;R. K. Iyer

  • Affiliations:
  • -;-

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 1993

Quantified Score

Hi-index 14.98

Visualization

Abstract

A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented. Basic system dependability characteristics such as error/failure distributions and hazard rate are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of error/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck. It is shown that for measured system, the homogeneous Markov model, which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated across machines and across time. The failure correlation coefficient is low. However, its effect on system unavailability is significant.