The Reliability Wall for Exascale Supercomputing

  • Authors:
  • Xuejun Yang;Zhiyuan Wang;Jingling Xue;Yun Zhou

  • Affiliations:
  • National University of Defense Technology, ChangSha;National University of Defense Technology, ChangSha;University of New South Wales, Sydney;National University of Defense Technology, ChangSha

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 2012

Quantified Score

Hi-index 14.98

Visualization

Abstract

Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of "Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing.