Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

  • Authors:
  • Nithin Nakka;Alok Choudhary

  • Affiliations:
  • Department of Electrical Engineering and Computer Science, Northwestern University, 2145 Sheridan Rd, Tech Inst. Bldg., EECS Dept., Evanston, IL;Department of Electrical Engineering and Computer Science, Northwestern University, 2145 Sheridan Rd, Tech Inst. Bldg., EECS Dept., Evanston, IL

  • Venue:
  • HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents our analysis of the failure behavior of large scale systems using the failure logs collected by Los Alamos National Laboratory on 22 of their computing clusters.We note that not all nodes show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of nodes to be incrementally (one by one) selected for duplication so as to achieve a target MTTF for the system after duplicating the least number of nodes. We arrived at a model for the fault coverage provided by duplicating each node and ordered the nodes according to coverage provided by each node. As compared to traditional approach of randomly choosing nodes for duplication, our model‐driven approach provides improvements ranging from 82% to 1700% depending on the improvement in MTTF that is targeted and the failure distribution of the nodes in the system.