Fault-tolerance design of the IBM Enterprise System/9000 Type 9021 processors
IBM Journal of Research and Development
Measurement and Generation of Error Correcting Codes for Package Failures
IEEE Transactions on Computers
Rescue: A Microarchitecture for Testability and Defect Tolerance
Proceedings of the 32nd annual international symposium on Computer Architecture
Software-based self-testing of microprocessors
Journal of Systems Architecture: the EUROMICRO Journal
POWER5 System microarchitecture
IBM Journal of Research and Development - POWER5 and packaging
Architecting a reliable CMP switch architecture
ACM Transactions on Architecture and Code Optimization (TACO)
IBM Journal of Research and Development
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Circuit design and modeling for soft errors
IBM Journal of Research and Development
Soft-error resilience of the IBM POWER6 processor
IBM Journal of Research and Development
System RAS implications of DRAM soft errors
IBM Journal of Research and Development
Architecture Design for Soft Errors
Architecture Design for Soft Errors
POWER4 system microarchitecture
IBM Journal of Research and Development
Improving yield and reliability of chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
Radiation-induced Soft Errors: A Chip-level Modeling Perspective
Foundations and Trends in Electronic Design Automation
Hi-index | 0.00 |
The POWER4-based p690 systems offer the highest performance of the IBM eServer pSeries™ line of computers. Within the general-purpose UNIX® server market, they also offer the highest levels of concurrent error detection, fault isolation, recovery, and availability. High availability is achieved by minimizing component failure rates through improvements in the base technology, and through design techniques that permit hardand soft-failure detection, recovery, and isolation, repair deferral, and component replacement concurrent with system operation. In this paper, we discuss the faulttolerant design techniques that were used for array, logic, storage, and I/O subsystems for the p690. We also present the diagnostic strategy, fault-isolation, and recovery techniques. New features such as POWER4 synchronous machine-check interrupt, PCI bus error recovery, array dynamic redundancy, and minimum-element dynamic reconfiguration are described. The design process used to verify error detection, fault isolation, and recovery is also described.