Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms

  • Authors:
  • Cristian Constantinescu

  • Affiliations:
  • Intel Corporation, Hillsboro, OR

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 2000

Quantified Score

Hi-index 14.98

Visualization

Abstract

Intel Corporation developed the Teraflops supercomputer for the US Department of Energy (DOE) as part of the Accelerated Strategic Computing Initiative (ASCI). This is the most powerful computing machine available today, performing over two trillion floating point operations per second with the aid of more than 9,000 Intel processors. The Teraflops machine employs complex hardware and software fault/error handling mechanisms for complying with DOE's reliability requirements. This paper gives a brief description of the system architecture and presents the validation of the fault tolerance mechanisms. Physical fault injection at the IC pin level was used for validation purposes. An original approach was developed for assessing signal sensitivity to transient faults and the effectiveness of the fault/error handling mechanisms. Dependency between fault/error detection coverage and fault duration was also determined. Fault injection experiments unveiled several malfunctions at the hardware, firmware, and software levels. The supercomputer performed according to the DOE requirements after corrective actions were implemented. The fault injection approach presented in this paper can be used for validation of any fault-tolerant or highly available computing system.