StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems

Authors:
Shantanu Gupta;Shuguang Feng;Amin Ansari;Jason Blome;Scott Mahlke
Affiliations:
University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA
Venue:
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Year:
2008

Citing 27
Cited 2

Terrestrial cosmic ray intensities

IBM Journal of Research and Development
Reliable computer systems (3rd ed.): design and evaluation

Reliable computer systems (3rd ed.): design and evaluation
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Estimation of the likelihood of capacitive coupling noise

Proceedings of the 39th annual Design Automation Conference
A Fault Tolerant Approach to Microprocessor Design

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Delay Model and Speculative Architecture for Pipelined Routers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Exploiting Microarchitectural Redundancy For Defect Tolerance

ICCD '03 Proceedings of the 21st International Conference on Computer Design
Tutorial Part 1: Nanometer-Scale CMOS Devices

ISQED '04 Proceedings of the 5th International Symposium on Quality Electronic Design
The Case for Lifetime Reliability-Aware Microprocessors

Proceedings of the 31st annual international symposium on Computer architecture
Tolerating Hard Faults in Microprocessor Array Structures

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The Recursive NanoBox Processor Grid: A Reliable System Architecture for Unreliable Nanotechnology Devices

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The Impact of Technology Scaling on Lifetime Reliability

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Exploiting Structural Duplication for Lifetime Reliability Enhancement

Proceedings of the 32nd annual international symposium on Computer Architecture
NonStop® Advanced Architecture

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
The Liberty Simulation Environment: A deliberate approach to high-level system modeling

ACM Transactions on Computer Systems (TOCS)
Scalable subgraph mapping for acyclic computation accelerators

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
ElastIC: An Adaptive Self-Healing Architecture for Unpredictable Silicon

IEEE Design & Test
Reunion: Complexity-Effective Multicore Redundancy

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Configurable isolation: building high availability systems with commodity multi-core processors

Proceedings of the 34th annual international symposium on Computer architecture
Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Self-calibrating Online Wearout Detection

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores

IEEE Micro
Core cannibalization architecture: improving lifetime chip performance for multicore processors in the presence of hard faults

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

IBM Journal of Research and Development

Core cannibalization architecture: improving lifetime chip performance for multicore processors in the presence of hard faults

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
The StageNet fabric for constructing resilient multicore systems

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although CMOS feature size scaling has been the source of dramatic performance gains, it has lead to mounting reliability concerns due to increasing power densities and on-chip temperatures. Given that most wearout mechanisms that plague semiconductor devices are highly dependent on these parameters, significantly higher failure rates are projected for future technology generations. Traditional techniques for dealing with device failures have relied on coarse-grained redundancy to maintain service in the face of failed components. In this work, we challenge this practice by identifying its inability to scale to high failure rate scenarios and investigate the advantages of finer-grained configurations. We use this study to motivate the design of StageNet, an embedded CMP architecture designed from its inception with reliability as a first class design constraint. StageNet relies on a reconfigurable network of replicated processor pipeline stages to maximize the useful lifetime of the chip, gracefully degrading performance toward end of life. This paper addresses the microarchitecture of the basic building block of StageNet, named StageNetSlice, which is a processor core comprised of networked pipeline stages. A naive slice design results in approximately 4X slowdown verses a traditional processor due to longer communication delays in the pipeline. However, several small design changes that eliminate inter-stage communication paths and minimize communication bandwidth reduce this overhead to 11% on average while providing high levels of fine-grain adaptability.