Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors

Authors:
Pramod Subramanyan;Virendra Singh;Kewal K. Saluja;Erik Larsson
Affiliations:
Indian Institute of Science, Bangalore, India;Indian Institute of Science, Bangalore, India;University of Wisconsin-Madison, Madison, WI;Linköping University, Linköping, Sweden
Venue:
Proceedings of the Conference on Design, Automation and Test in Europe
Year:
2010

Citing 23
Cited 2

Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reliability, availability, and serviceability (RAS) of the IBM eServer z990

IBM Journal of Research and Development
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
NonStop® Advanced Architecture

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research

IEEE Computer Architecture Letters
Characterization of simultaneous multithreading (SMT) efficiency in POWER5

IBM Journal of Research and Development - POWER5 and packaging
Reunion: Complexity-Effective Multicore Redundancy

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Configurable isolation: building high availability systems with commodity multi-core processors

Proceedings of the 34th annual international symposium on Computer architecture
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
INTACTE: an interconnect area, delay, and energy estimation tool for microarchitectural explorations

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Reliable Systems on Unreliable Fabrics

IEEE Design & Test
DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals

IEEE Computer Architecture Letters
Architectural core salvaging in a multi-core processor for hard-error tolerance

Proceedings of the 36th annual international symposium on Computer architecture

Energy-efficient redundant execution for chip multiprocessors

Proceedings of the 20th symposium on Great lakes symposium on VLSI
Cost-effective safety and fault localization using distributed temporal redundancy

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Continued CMOS scaling is expected to make future microprocessors susceptible to transient faults, hard faults, manufacturing defects and process variations causing fault tolerance to become important even for general purpose processors targeted at the commodity market. To mitigate the effect of decreased reliability, a number of fault-tolerant architectures have been proposed that exploit the natural coarse-grained redundancy available in chip multiprocessors (CMPs). These architectures execute a single application using two threads, typically as one leading thread and one trailing thread. Errors are detected by comparing the outputs produced by these two threads. These architectures schedule a single application on two cores or two thread contexts of a CMP. As a result, besides the additional energy consumption and performance overhead that is required to provide fault tolerance, such schemes also impose a throughput loss. Consequently a CMP which is capable of executing 2n threads in non-redundant mode can only execute half as many (n) threads in fault-tolerant mode. In this paper we propose multiplexed redundant execution (MRE), a low-overhead architectural technique that executes multiple trailing threads on a single processor core. MRE exploits the observation that it is possible to accelerate the execution of the trailing thread by providing execution assistance from the leading thread. Execution assistance combined with coarse-grained multithreading allows MRE to schedule multiple trailing threads concurrently on a single core with only a small performance penalty. Our results show that MRE increases the throughput of fault-tolerant CMP by 16% over an ideal dual modular redundant (DMR) architecture.