IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

Authors:
L. Spainhower;T. A. Gregg
Affiliations:
IBM Server Development, Poughkeepsie, New York;IBM System, Poughkeepsie, New York
Venue:
IBM Journal of Research and Development
Year:
1999

Citing 9
Cited 44

Reliable computer systems (2nd ed.): design and evaluation

Reliable computer systems (2nd ed.): design and evaluation
S/390 cluster technology: Parallel Sysplex

IBM Systems Journal
S/390 CMOS server I/O: the continuing evolution

IBM Journal of Research and Development - Special issue: IBM S/390 G3 and G4
A high-frequency custom CMOS S/390 microprocessor

IBM Journal of Research and Development - Special issue: IBM S/390 G3 and G4
IBM's ES/9000 Model 982's Fault-Tolerant Design for Consolidation

IEEE Micro
G4: A Fault-Tolerant CMOS Mainframe

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
How Fail-Stop are Faulty Programs?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
The S/390 G5/G6 binodal cache

IBM Journal of Research and Development
The integrated cluster bus for the IBM S/390 parallel Sysplex

IBM Journal of Research and Development

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A Design Diversity Metric and Analysis of Redundant Systems

IEEE Transactions on Computers
COMBINATIONAL LOGIC SYNTHESIS FOR DIVERSITY IN DUPLEX SYSTEMS

ITC '00 Proceedings of the 2000 IEEE International Test Conference
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
The Case for Lifetime Reliability-Aware Microprocessors

Proceedings of the 31st annual international symposium on Computer architecture
Creating Value Through Test

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
IC Cost Reduction by Applying Embedded Fault Tolerance for Soft Errors

Journal of Electronic Testing: Theory and Applications
Efficient Design Diversity Estimation for Combinational Circuits

IEEE Transactions on Computers
Commercial Fault Tolerance: A Tale of Two Systems

IEEE Transactions on Dependable and Secure Computing
Reflections on Industry Trends and Experimental Research in Dependability

IEEE Transactions on Dependable and Secure Computing
Logic soft errors in sub-65nm technologies design and CAD challenges

Proceedings of the 42nd annual Design Automation Conference
Exploiting Structural Duplication for Lifetime Reliability Enhancement

Proceedings of the 32nd annual international symposium on Computer Architecture
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Autonomic Microprocessor Execution via Self-Repairing Arrays

IEEE Transactions on Dependable and Secure Computing
Software-Based Fault Tolerant Computing

Ubiquity
Autonomous recovery in componentized Internet applications

Cluster Computing
ReStore: Symptom-Based Soft Error Detection in Microprocessors

IEEE Transactions on Dependable and Secure Computing
Cost-efficient soft error protection for embedded microprocessors

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Enhanced I/O subsystem recovery and availability on the IBM System z9

IBM Journal of Research and Development
Redundant I/O interconnect

IBM Journal of Research and Development
Online diagnosis of hard faults in microprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Core cannibalization architecture: improving lifetime chip performance for multicore processors in the presence of hard faults

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
The StageNet fabric for constructing resilient multicore systems

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
End-to-end register data-flow continuous self-test

Proceedings of the 36th annual international symposium on Computer architecture
System-on-Chip Test Architectures: Nanometer Design for Testability

System-on-Chip Test Architectures: Nanometer Design for Testability
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Selective replication: A lightweight technique for soft errors

ACM Transactions on Computer Systems (TOCS)
Custom S/390 G5 and G6 microprocessors

IBM Journal of Research and Development
S/390 microprocessor design

IBM Journal of Research and Development
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A fault-tolerant strategy for virtualized HPC clusters

The Journal of Supercomputing
Shoestring: probabilistic soft error reliability on the cheap

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Necromancer: enhancing system throughput by animating dead cores

Proceedings of the 37th annual international symposium on Computer architecture
Cross-layer resilience challenges: metrics and optimization

Proceedings of the Conference on Design, Automation and Test in Europe
System-level hardware-based protection of memories against soft-errors

Proceedings of the Conference on Design, Automation and Test in Europe
Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A systematic methodology to develop resilient cache coherence protocols

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Encore: low-cost, fine-grained transient fault recovery

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient soft error protection for commodity embedded microprocessors using profile information

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Software encoded processing: building dependable systems with commodity hardware

SAFECOMP'07 Proceedings of the 26th international conference on Computer Safety, Reliability, and Security
Reconfigurable Concurrent Error Detection Adaptive to Dynamicity of Power Constraints

Journal of Electronic Testing: Theory and Applications
A survey of checker architectures

ACM Computing Surveys (CSUR)
Epipe: A low-cost fault-tolerance technique considering WCET constraints

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

Fault tolerance in IBM S/390® systems during the 1980s and 1990s had three distinct phases, each characterized by a different uptime improvement rate. Early TCM-technology mainframes delivered excellent data integrity, instantaneous error detection, and positive fault isolation, but had limited on-line repair. Later TCM mainframes introduced capabilities for providing a high degree of transparent recovery, failure masking, and on-line repair. New challenges accompanied the introduction of CMOS technology. A significant reduction in parts count greatly improved intrinsic failure rates, but dense packaging disallowed on-line CPU repair. In addition, characteristics of the microprocessor technology posed difficulties for traditional in-line error checking. As a result, system fault-tolerant design, particularly in CPUs and memory, underwent another evolution from G1 to G5. G5 implements an innovative design for a high-performance, fault-tolerant single-chip microprocessor. Dynamic CPU sparing delivers a transparent concurrent repair mechanism. A new internal channel provides a high-performance, highly available Parallel Sysplex® in a single mainframe. G5 is both the culmination of decades of innovation and careful implementation, and the highest achievement of S/390 fault-tolerant design.