Reliability, availability, and serviceability (RAS) of the IBM eServer z990
IBM Journal of Research and Development
Evaluating coverage of error detection logic for soft errors using formal methods
Proceedings of the conference on Design, automation and test in Europe: Proceedings
Soft-error resilience of the IBM POWER6 processor
IBM Journal of Research and Development
Architecture Design for Soft Errors
Architecture Design for Soft Errors
Design and microarchitecture of the IBM system z10 microprocessor
IBM Journal of Research and Development
Functional verification of the IBM system z10 processor chipset
IBM Journal of Research and Development
Assessing system vulnerability using formal verification techniques
MEMICS'11 Proceedings of the 7th international conference on Mathematical and Engineering Methods in Computer Science
Hi-index | 0.00 |
IBM System z processors are known for their industry leading Reliability, Availability and Serviceability (RAS). The hardware is designed to support a high resilience against errors and the ability to recover from errors maintaining a valid architectural state. This paper describes the thorough verification effort required to prove that the fault tolerance of the IBM System z processor core matches the high expectations prior to design tape-out. This paper proposes a multifaceted verification methodology to cover the various aspects of verifying correct error detection, isolation and recovery. Soft errors enlarge the state space of a design significantly. This provides a significant challenge to the functional verification environment in order to tolerate the fails and to expect architectural compliance. Several fault injection mechanisms are discussed. A special focus is on the novel methodology of Comprehensive Fault Injection (CFI) used to validate and improve the dependability characteristics of the processor core, providing improved Soft Error Resilience (SER). Feedback of the results and measurements of the efficiency and functional coverage are an integral part of the overall methodology, allowing the smart use of the available compute resources.