IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

  • Authors:
  • L. Spainhower;T. A. Gregg

  • Affiliations:
  • IBM Server Development, Poughkeepsie, New York;IBM System, Poughkeepsie, New York

  • Venue:
  • IBM Journal of Research and Development
  • Year:
  • 1999

Quantified Score

Hi-index 0.01

Visualization

Abstract

Fault tolerance in IBM S/390® systems during the 1980s and 1990s had three distinct phases, each characterized by a different uptime improvement rate. Early TCM-technology mainframes delivered excellent data integrity, instantaneous error detection, and positive fault isolation, but had limited on-line repair. Later TCM mainframes introduced capabilities for providing a high degree of transparent recovery, failure masking, and on-line repair. New challenges accompanied the introduction of CMOS technology. A significant reduction in parts count greatly improved intrinsic failure rates, but dense packaging disallowed on-line CPU repair. In addition, characteristics of the microprocessor technology posed difficulties for traditional in-line error checking. As a result, system fault-tolerant design, particularly in CPUs and memory, underwent another evolution from G1 to G5. G5 implements an innovative design for a high-performance, fault-tolerant single-chip microprocessor. Dynamic CPU sparing delivers a transparent concurrent repair mechanism. A new internal channel provides a high-performance, highly available Parallel Sysplex® in a single mainframe. G5 is both the culmination of decades of innovation and careful implementation, and the highest achievement of S/390 fault-tolerant design.