Verification for fault tolerance of the IBM system z microprocessor

  • Authors:
  • Brian W. Thompto;Bodo Hoppe

  • Affiliations:
  • IBM Systems & Technology Group, Austin, Texas;IBM Germany Research & Development GmbH, Boeblingen, Germany

  • Venue:
  • Proceedings of the 47th Design Automation Conference
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

IBM System z processors are known for their industry leading Reliability, Availability and Serviceability (RAS). The hardware is designed to support a high resilience against errors and the ability to recover from errors maintaining a valid architectural state. This paper describes the thorough verification effort required to prove that the fault tolerance of the IBM System z processor core matches the high expectations prior to design tape-out. This paper proposes a multifaceted verification methodology to cover the various aspects of verifying correct error detection, isolation and recovery. Soft errors enlarge the state space of a design significantly. This provides a significant challenge to the functional verification environment in order to tolerate the fails and to expect architectural compliance. Several fault injection mechanisms are discussed. A special focus is on the novel methodology of Comprehensive Fault Injection (CFI) used to validate and improve the dependability characteristics of the processor core, providing improved Soft Error Resilience (SER). Feedback of the results and measurements of the efficiency and functional coverage are an integral part of the overall methodology, allowing the smart use of the available compute resources.