Failure Data Analysis of a LAN of Windows NT Based Computers

  • Authors:
  • M. Kalyanakrishnam;Z. Kalbarczyk;R. Iyer

  • Affiliations:
  • -;-;-

  • Venue:
  • SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents results of a failure data analysis of a LAN of Windows NT machines. Data for the study was obtained from event logs collected over a six-month period from the mail routing network of a commercial organization. The study focuses on characterizing causes of machine reboots. The key observations from this study are: (1) most of the problems that lead to reboots are software related, (2) rebooting the machine does not always solve the problem (in about 60% of the reboots, the re-booted machine reported problems within an hour or two of the reboot), (3) there are indications of propagated or correlated failures, and (4) though the average availability evaluates to over 99%, the machine downtime lasts (on average) two hours. Since the machines are dedicated mail servers, bringing down one or more of them can potentially disrupt storage, forwarding, reception and delivery of mail. This suggests that the average availability is not a good measure to characterize this type of network service.