Towards scalable reliability frameworks for error prone CMPs

Authors:
Joseph Sloan;Rakesh Kumar
Affiliations:
Coordinated Science Laboratory, University of Illinois, Urbana, IL, USA;Coordinated Science Laboratory, University of Illinois, Urbana, IL, USA
Venue:
CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Year:
2009

Citing 13
Cited 1

The NYU ultracomputer—designing a MIMD, shared-memory parallel machine

25 years of the international symposia on Computer architecture (selected papers)
Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design

Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Trends and Challenges in VLSI Circuit Reliability

IEEE Micro
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Modeling and analysis of circuit performance of ballistic CNFET

Proceedings of the 43rd annual Design Automation Conference
Reunion: Complexity-Effective Multicore Redundancy

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Probabilistic system-on-a-chip architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Soft-error resilience of the IBM POWER6 processor

IBM Journal of Research and Development
DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals

IEEE Computer Architecture Letters

Cost-effective safety and fault localization using distributed temporal redundancy

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.