AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Authors:
Eric Rotenberg
Affiliations:
-
Venue:
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Year:
1999

Citing 0
Cited 100

DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ACM SIGPLAN Notices
Silent stores for free

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Efficient checker processor design

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A study of slipstream processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Dynamically allocating processor resources between nearby and distant ILP

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Speculative dynamic vectorization

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Watchdog Processor Architecture with Minimal Performance Overhead

SAFECOMP '02 Proceedings of the 21st International Conference on Computer Safety, Reliability and Security
A Fault Tolerant Approach to Microprocessor Design

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
REESE: A Method of Soft Error Detection in Microprocessors

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Execution Latency Reduction via Variable Latency Pipeline and Instruction Reuse

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Billion-Transistor Architectures: There and Back Again

Computer
Enhancing data cache reliability by the addition of a small fully-associative replication cache

Proceedings of the 18th annual international conference on Supercomputing
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
The Case for Lifetime Reliability-Aware Microprocessors

Proceedings of the 31st annual international symposium on Computer architecture
A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy

Proceedings of the 31st annual international symposium on Computer architecture
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
SWIFT: Software Implemented Fault Tolerance

Proceedings of the international symposium on Code generation and optimization
Opportunistic Transient-Fault Detection

Proceedings of the 32nd annual international symposium on Computer Architecture
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Replication Cache: A Small Fully Associative Cache to Improve Data Cache Reliability

IEEE Transactions on Computers
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Autonomic Microprocessor Execution via Self-Repairing Arrays

IEEE Transactions on Dependable and Secure Computing
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Software-controlled fault tolerance

ACM Transactions on Architecture and Code Optimization (TACO)
Opportunistic Transient-Fault Detection

IEEE Micro
Using Abstraction for Efficient Formal Verification of Pipelined Processors with Value Prediction

ISQED '06 Proceedings of the 7th International Symposium on Quality Electronic Design
Self-Stabilizing Microprocessor: Analyzing and Overcoming Soft Errors

IEEE Transactions on Computers
Software based fault tolerance: a survey

Ubiquity
Self-checking instructions: reducing instruction redundancy for concurrent error detection

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
ReStore: Symptom-Based Soft Error Detection in Microprocessors

IEEE Transactions on Dependable and Secure Computing
Static typing for a faulty lambda calculus

Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Cost-efficient soft error protection for embedded microprocessors

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Dependable ≠ unaffordable

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Architecting a reliable CMP switch architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Examining ACE analysis reliability estimates using fault-injection

Proceedings of the 34th annual international symposium on Computer architecture
Configurable isolation: building high availability systems with commodity multi-core processors

Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for bounding vulnerabilities of processor structures

Proceedings of the 34th annual international symposium on Computer architecture
Dynamic prediction of architectural vulnerability from microarchitectural state

Proceedings of the 34th annual international symposium on Computer architecture
Online diagnosis of hard faults in microprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
A low-SER efficient core processor architecture for future technologies

Proceedings of the conference on Design, automation and test in Europe
Isolation in Commodity Multicore Processors

Computer
Power and reliability management of SoCs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Efficient fault tolerance in multi-media applications through selective instruction replication

Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies
Skewed redundancy

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Datapath error detection with no detection latency for high-performance microprocessors

WSEAS Transactions on Computers
Improving error tolerance for multithreaded register files

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Compiler-assisted soft error detection under performance and energy constraints in embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
Sequential element design with built-in soft error resilience

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
End-to-end register data-flow continuous self-test

Proceedings of the 36th annual international symposium on Computer architecture
Multi-execution: multicore caching for data-similar executions

Proceedings of the 36th annual international symposium on Computer architecture
Instruction-Level Fault Tolerance Configurability

Journal of Signal Processing Systems
REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Selective replication: A lightweight technique for soft errors

ACM Transactions on Computer Systems (TOCS)
Reliable data path design of VLIW processor cores with comprehensive error-coverage assessment

Microprocessors & Microsystems
Shoestring: probabilistic soft error reliability on the cheap

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Improving chip multiprocessor reliability through code replication

Computers and Electrical Engineering
Reducing misspeculation penalty in trace-level speculative multithreaded architectures

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Energy-efficient redundant execution for chip multiprocessors

Proceedings of the 20th symposium on Great lakes symposium on VLSI
DAFT: decoupled acyclic fault tolerance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Design techniques for cross-layer resilience

Proceedings of the Conference on Design, Automation and Test in Europe
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Method for formal verification of soft-error tolerance mechanisms in pipelined microprocessors

ICFEM'10 Proceedings of the 12th international conference on Formal engineering methods and software engineering
On the exploitation of narrow-width values for improving register file reliability

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
An FPGA-based experimental evaluation of microprocessor core error detection with Argus-2

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Releasing efficient beta cores to market early

Proceedings of the 38th annual international symposium on Computer architecture
An FPGA-based experimental evaluation of microprocessor core error detection with Argus-2

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
A self-checking hardware journal for a fault-tolerant processor architecture

International Journal of Reconfigurable Computing - Special issue on selected papers from the international workshop on reconfigurable communication-centric systems on chips (ReCoSoC' 2010)
Trade-offs in transient fault recovery schemes for redundant multithreaded processors

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Optimization of reliability and power consumption in systems on a chip

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Encore: low-cost, fine-grained transient fault recovery

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Resource-Driven optimizations for transient-fault detecting superscalar microarchitectures

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Efficient soft error protection for commodity embedded microprocessors using profile information

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Runtime asynchronous fault tolerance via speculation

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Setting an error detection infrastructure with low cost acoustic wave detectors

Proceedings of the 39th Annual International Symposium on Computer Architecture
Dynamic transient fault detection and recovery for embedded processor datapaths

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Time-Constraint-Aware Optimization of Assertions in Embedded Software

Journal of Electronic Testing: Theory and Applications
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Low cost control flow protection using abstract control signatures

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory

Proceedings of the ACM International Conference on Computing Frontiers
FaulTM: error detection and recovery using hardware transactional memory

Proceedings of the Conference on Design, Automation and Test in Europe
A work-stealing scheduling framework supporting fault tolerance

Proceedings of the Conference on Design, Automation and Test in Europe
A survey of checker architectures

ACM Computing Surveys (CSUR)
Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
A low-power instruction replay mechanism for design of resilient microprocessors

ACM Transactions on Embedded Computing Systems (TECS)
Epipe: A low-cost fault-tolerance technique considering WCET constraints

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper speculates that technology trends pose new challenges for fault tolerance in microprocessors. Specifically, severely reduced design tolerances implied by gigaherz clock rates may result in frequent and arbitrary transient faults. We suggest that existing fault-tolerant techniques -- system-level, gate-level, or component-specific approaches -- are either too costly for general purpose computing, overly intrusive to the design, or insufficient for covering arbitrary logic faults. An approach in which the microarchitecture itself provides fault tolerance is required.We propose a new time redundancy fault-tolerant approach in which a program is duplicated and the two redundant programs simultaneously run on the processor. The technique exploits several significant microarchitectural trends to provide broad coverage of transient faults and restricted coverage of permanent faults. These trends are simultaneous multithreading, control flow and data flow prediction, and hierarchical processors -- all of which are intended for higher performance, but which can be easily leveraged for the specified fault tolerance goals. The overhead for achieving fault tolerance is low, both in terms of performance and changes to the existing microarchitecture. Detailed simulations of five of the SPEC95 benchmarks show that executing two redundant programs on the fault-tolerant microarchitecture takes only 10% to 30% longer than running a single version of the program.