Reliable computer systems (2nd ed.): design and evaluation
Reliable computer systems (2nd ed.): design and evaluation
S/390 cluster technology: Parallel Sysplex
IBM Systems Journal
S/390 CMOS server I/O: the continuing evolution
IBM Journal of Research and Development - Special issue: IBM S/390 G3 and G4
A high-frequency custom CMOS S/390 microprocessor
IBM Journal of Research and Development - Special issue: IBM S/390 G3 and G4
G4: A Fault-Tolerant CMOS Mainframe
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
How Fail-Stop are Faulty Programs?
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
IBM Journal of Research and Development
The integrated cluster bus for the IBM S/390 parallel Sysplex
IBM Journal of Research and Development
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A Design Diversity Metric and Analysis of Redundant Systems
IEEE Transactions on Computers
COMBINATIONAL LOGIC SYNTHESIS FOR DIVERSITY IN DUPLEX SYSTEMS
ITC '00 Proceedings of the 2000 IEEE International Test Conference
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
The Case for Lifetime Reliability-Aware Microprocessors
Proceedings of the 31st annual international symposium on Computer architecture
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
IC Cost Reduction by Applying Embedded Fault Tolerance for Soft Errors
Journal of Electronic Testing: Theory and Applications
Efficient Design Diversity Estimation for Combinational Circuits
IEEE Transactions on Computers
Commercial Fault Tolerance: A Tale of Two Systems
IEEE Transactions on Dependable and Secure Computing
Reflections on Industry Trends and Experimental Research in Dependability
IEEE Transactions on Dependable and Secure Computing
Logic soft errors in sub-65nm technologies design and CAD challenges
Proceedings of the 42nd annual Design Automation Conference
Exploiting Structural Duplication for Lifetime Reliability Enhancement
Proceedings of the 32nd annual international symposium on Computer Architecture
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Autonomic Microprocessor Execution via Self-Repairing Arrays
IEEE Transactions on Dependable and Secure Computing
Autonomous recovery in componentized Internet applications
Cluster Computing
ReStore: Symptom-Based Soft Error Detection in Microprocessors
IEEE Transactions on Dependable and Secure Computing
Cost-efficient soft error protection for embedded microprocessors
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Enhanced I/O subsystem recovery and availability on the IBM System z9
IBM Journal of Research and Development
IBM Journal of Research and Development
Online diagnosis of hard faults in microprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
The StageNet fabric for constructing resilient multicore systems
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
End-to-end register data-flow continuous self-test
Proceedings of the 36th annual international symposium on Computer architecture
System-on-Chip Test Architectures: Nanometer Design for Testability
System-on-Chip Test Architectures: Nanometer Design for Testability
Architecture Design for Soft Errors
Architecture Design for Soft Errors
Selective replication: A lightweight technique for soft errors
ACM Transactions on Computer Systems (TOCS)
Custom S/390 G5 and G6 microprocessors
IBM Journal of Research and Development
IBM Journal of Research and Development
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A fault-tolerant strategy for virtualized HPC clusters
The Journal of Supercomputing
Shoestring: probabilistic soft error reliability on the cheap
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Necromancer: enhancing system throughput by animating dead cores
Proceedings of the 37th annual international symposium on Computer architecture
Cross-layer resilience challenges: metrics and optimization
Proceedings of the Conference on Design, Automation and Test in Europe
System-level hardware-based protection of memories against soft-errors
Proceedings of the Conference on Design, Automation and Test in Europe
Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A systematic methodology to develop resilient cache coherence protocols
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Encore: low-cost, fine-grained transient fault recovery
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient soft error protection for commodity embedded microprocessors using profile information
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Software encoded processing: building dependable systems with commodity hardware
SAFECOMP'07 Proceedings of the 26th international conference on Computer Safety, Reliability, and Security
Reconfigurable Concurrent Error Detection Adaptive to Dynamicity of Power Constraints
Journal of Electronic Testing: Theory and Applications
A survey of checker architectures
ACM Computing Surveys (CSUR)
Epipe: A low-cost fault-tolerance technique considering WCET constraints
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.01 |
Fault tolerance in IBM S/390® systems during the 1980s and 1990s had three distinct phases, each characterized by a different uptime improvement rate. Early TCM-technology mainframes delivered excellent data integrity, instantaneous error detection, and positive fault isolation, but had limited on-line repair. Later TCM mainframes introduced capabilities for providing a high degree of transparent recovery, failure masking, and on-line repair. New challenges accompanied the introduction of CMOS technology. A significant reduction in parts count greatly improved intrinsic failure rates, but dense packaging disallowed on-line CPU repair. In addition, characteristics of the microprocessor technology posed difficulties for traditional in-line error checking. As a result, system fault-tolerant design, particularly in CPUs and memory, underwent another evolution from G1 to G5. G5 implements an innovative design for a high-performance, fault-tolerant single-chip microprocessor. Dynamic CPU sparing delivers a transparent concurrent repair mechanism. A new internal channel provides a high-performance, highly available Parallel Sysplex® in a single mainframe. G5 is both the culmination of decades of innovation and careful implementation, and the highest achievement of S/390 fault-tolerant design.