Architecting a reliable CMP switch architecture

Authors:
Kypros Constantinides;Stephen Plaza;Jason Blome;Valeria Bertacco;Scott Mahlke;Todd Austin;Bin Zhang;Michael Orshansky
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Texas at Austin, Austin, TX;University of Texas at Austin, Austin, TX
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2007

Citing 33
Cited 2

Design verification via simulation and automatic test pattern generation

ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
IBM experiments in soft fails in computer electronics (1978–1994)

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Terrestrial cosmic rays

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Multilevel hypergraph partitioning: application in VLSI domain

DAC '97 Proceedings of the 34th annual Design Automation Conference
Reliable computer systems (3rd ed.): design and evaluation

Reliable computer systems (3rd ed.): design and evaluation
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Few electron devices: towards hybrid CMOS-SET integrated circuits

Proceedings of the 39th annual Design Automation Conference
Estimation of the likelihood of capacitive coupling noise

Proceedings of the 39th annual Design Automation Conference
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Testing ICs: Getting to the Core of the Problem

Computer
The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers

PCRCW '94 Proceedings of the First International Workshop on Parallel Computer Routing and Communication
A Fault Tolerant Approach to Microprocessor Design

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
The Fail-Stop Controller AE11

Proceedings of the IEEE International Test Conference
G4: A Fault-Tolerant CMOS Mainframe

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
Statistical estimation of leakage current considering inter- and intra-die process variation

Proceedings of the 2003 international symposium on Low power electronics and design
Flow control and micro-architectural mechanisms for extending the performance of interconnection networks

Flow control and micro-architectural mechanisms for extending the performance of interconnection networks
Exploiting Microarchitectural Redundancy For Defect Tolerance

ICCD '03 Proceedings of the 21st International Conference on Computer Design
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Design and reliability challenges in nanometer technologies

Proceedings of the 41st annual Design Automation Conference
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
Tolerating Hard Faults in Microprocessor Array Structures

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The Impact of Technology Scaling on Lifetime Reliability

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Manufacturing-Aware Physical Design

Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Susceptibility of Commodity Systems and Software to Memory Soft Errors

IEEE Transactions on Computers
The Soft Error Problem: An Architectural Perspective

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology

IBM Journal of Research and Development
Reliability limits for the gate insulator in CMOS technology

IBM Journal of Research and Development

Workload capacity considering NBTI degradation in multi-core systems

Proceedings of the 2010 Asia and South Pacific Design Automation Conference
Workload assignment considering NBTI degradation in multicore systems

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011

Quantified Score

Hi-index	0.00

Visualization

Abstract

As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-times-to-failure. In this article, we examine the challenges of designing complex computing systems in the presence of transient and permanent faults. We select one small aspect of a typical chip multiprocessor (CMP) system to study in detail, a single CMP router switch. Our goal is to design a BulletProof CMP switch architecture capable of tolerating significant levels of various types of defects. We first assess the vulnerability of the CMP switch to transient faults. To better understand the impact of these faults, we evaluate our CMP switch designs using circuit-level timing on detailed physical layouts. Our infrastructure represents a new level of fidelity in architectural-level fault analysis, as we can accurately track faults as they occur, noting whether they manifest or not, because of masking in the circuits, logic, or architecture. Our experimental results are quite illuminating. We find that transient faults, because of their fleeting nature, are of little concern for our CMP switch, even within large switch fabrics with fast clocks. Next, we develop a unified model of permanent faults, based on the time-tested bathtub curve. Using this convenient abstraction, we analyze the reliability versus area tradeoff across a wide spectrum of CMP switch designs, ranging from unprotected designs to fully protected designs with on-line repair and recovery capabilities. Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design. We find that designs are attainable that can tolerate a larger number of defects with less overhead than naïve triple-modular redundancy, using domain-specific techniques, such as end-to-end error detection, resource sparing, automatic circuit decomposition, and iterative diagnosis and reconfiguration.