The Reliability Wall for Exascale Supercomputing

Authors:
Xuejun Yang;Zhiyuan Wang;Jingling Xue;Yun Zhou
Affiliations:
National University of Defense Technology, ChangSha;National University of Defense Technology, ChangSha;University of New South Wales, Sydney;National University of Defense Technology, ChangSha
Venue:
IEEE Transactions on Computers
Year:
2012

Citing 0
Cited 3

NV-process: a fault-tolerance process model based on non-volatile memory

Proceedings of the Asia-Pacific Workshop on Systems
NV-process: a fault-tolerance process model based on non-volatile memory

APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
Epipe: A low-cost fault-tolerance technique considering WCET constraints

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	14.98

Visualization

Abstract

Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of "Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing.