Transient fault detection via simultaneous multithreading

Authors:
Steven K. Reinhardt;Shubhendu S. Mukherjee
Affiliations:
EECS Department, University of Michigan, Ann Arbor, 1301 Beal Avenue, Ann Arbor, MI;VSSAD, Alpha Technology Group, Compaq Computer Corporation, 334 South Street, Mail Stop SHR3-2E/R28, Shrewsbury, MA
Venue:
Proceedings of the 27th annual international symposium on Computer architecture
Year:
2000

Citing 13
Cited 135

Implementing Precise Interrupts in Pipelined Processors

IEEE Transactions on Computers
Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers

IEEE Transactions on Computers
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Reliable computer systems (3rd ed.): design and evaluation

Reliable computer systems (3rd ed.): design and evaluation
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
IBM's S/390 G5 Microprocessor Design

IEEE Micro
Concurrent Error Detection Using Watchdog Processors-A Survey

IEEE Transactions on Computers
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Incorporating fault tolerance in superscalar processors

HIPC '96 Proceedings of the Third International Conference on High-Performance Computing (HiPC '96)

Slipstream processors: improving both performance and fault tolerance

ACM SIGPLAN Notices
Efficient checker processor design

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A study of slipstream processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Dynamically allocating processor resources between nearby and distant ILP

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Fault Tolerant Approach to Microprocessor Design

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Billion-Transistor Architectures: There and Back Again

Computer
Enhancing data cache reliability by the addition of a small fully-associative replication cache

Proceedings of the 18th annual international conference on Supercomputing
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy

Proceedings of the 31st annual international symposium on Computer architecture
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A Case for Clumsy Packet Processors

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
SWIFT: Software Implemented Fault Tolerance

Proceedings of the international symposium on Code generation and optimization
Increasing Register File Immunity to Transient Errors

Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Improving java virtual machine reliability for memory-constrained embedded systems

Proceedings of the 42nd annual Design Automation Conference
Design and Evaluation of Hybrid Fault-Detection Systems

Proceedings of the 32nd annual international symposium on Computer Architecture
Opportunistic Transient-Fault Detection

Proceedings of the 32nd annual international symposium on Computer Architecture
Recursive TMR: Scaling Fault Tolerance in the Nanoscale Era

IEEE Design & Test
Compiler-guided register reliability improvement against soft errors

Proceedings of the 5th ACM international conference on Embedded software
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Replication Cache: A Small Fully Associative Cache to Improve Data Cache Reliability

IEEE Transactions on Computers
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Autonomic Microprocessor Execution via Self-Repairing Arrays

IEEE Transactions on Dependable and Secure Computing
Power-Efficient Error Tolerance in Chip Multiprocessors

IEEE Micro
Compiler-directed channel allocation for saving power in on-chip networks

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Software-controlled fault tolerance

ACM Transactions on Architecture and Code Optimization (TACO)
Opportunistic Transient-Fault Detection

IEEE Micro
Object duplication for improving reliability

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Self-Stabilizing Microprocessor: Analyzing and Overcoming Soft Errors

IEEE Transactions on Computers
Runtime integrity checking for inter-object connections

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
A cost-effective implementation of an ECC-protected instruction queue for out-of-order microprocessors

Proceedings of the 43rd annual Design Automation Conference
Software based fault tolerance: a survey

Ubiquity
Self-checking instructions: reducing instruction redundancy for concurrent error detection

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
ReStore: Symptom-Based Soft Error Detection in Microprocessors

IEEE Transactions on Dependable and Secure Computing
Static typing for a faulty lambda calculus

Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
SlicK: slice-based locality exploitation for efficient redundant multithreading

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
MRF Reinforcer: A Probabilistic Element for Space Redundancy in Nanoscale Circuits

IEEE Micro
Cost-efficient soft error protection for embedded microprocessors

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Reunion: Complexity-Effective Multicore Redundancy

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Architecting a reliable CMP switch architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Online task-scheduling for fault-tolerant low-energy real-time systems

Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
Examining ACE analysis reliability estimates using fault-injection

Proceedings of the 34th annual international symposium on Computer architecture
Configurable isolation: building high availability systems with commodity multi-core processors

Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for bounding vulnerabilities of processor structures

Proceedings of the 34th annual international symposium on Computer architecture
Dynamic prediction of architectural vulnerability from microarchitectural state

Proceedings of the 34th annual international symposium on Computer architecture
Online diagnosis of hard faults in microprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Fault-tolerant typed assembly language

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection

Proceedings of the International Symposium on Code Generation and Optimization
Modeling and improving data cache reliability: 1

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Automatic Instruction-Level Software-Only Recovery

IEEE Micro
Microprocessors in the era of terascale integration

Proceedings of the conference on Design, automation and test in Europe
Transient fault prediction based on anomalies in processor events

Proceedings of the conference on Design, automation and test in Europe
A low-SER efficient core processor architecture for future technologies

Proceedings of the conference on Design, automation and test in Europe
A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Hierarchical Verification for Increasing Performance in Reliable Processors

Journal of Electronic Testing: Theory and Applications
Efficient fault tolerance in multi-media applications through selective instruction replication

Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies
Anomaly-based fault detection in pervasive computing system

Proceedings of the 5th international conference on Pervasive services
A Systematic Approach to Automatically Generate Multiple Semantically Equivalent Program Versions

Ada-Europe '08 Proceedings of the 13th Ada-Europe international conference on Reliable Software Technologies
Reasoning about Control Flow in the Presence of Transient Faults

SAS '08 Proceedings of the 15th international symposium on Static Analysis
Techniques for Efficient Software Checking

Languages and Compilers for Parallel Computing
A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Skewed redundancy

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Understanding software approaches for GPGPU reliability

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Improving error tolerance for multithreaded register files

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Implementing high availability memory with a duplication cache

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Self-recovery in server programs

Proceedings of the 2009 international symposium on Memory management
ESoftCheck: Removal of Non-vital Checks for Fault Tolerance

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Compiler-assisted soft error detection under performance and energy constraints in embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
Instruction-Level Fault Tolerance Configurability

Journal of Signal Processing Systems
REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Synchronizing redundant cores in a dynamic DMR multicore architecture

IEEE Transactions on Circuits and Systems II: Express Briefs
Selective replication: A lightweight technique for soft errors

ACM Transactions on Computer Systems (TOCS)
Exploiting memory soft redundancy for joint improvement of error tolerance and access efficiency

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Shoestring: probabilistic soft error reliability on the cheap

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Improving chip multiprocessor reliability through code replication

Computers and Electrical Engineering
Energy-efficient redundant execution for chip multiprocessors

Proceedings of the 20th symposium on Great lakes symposium on VLSI
Characterizing the soft error vulnerability of multicores running multithreaded applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A compiler-based infrastructure for fault-tolerant co-design

Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems
A cost effective approach for online error detection using invariant relationships

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Modeling soft errors for data caches and alleviating their effects on data reliability

Microprocessors & Microsystems
DAFT: decoupled acyclic fault tolerance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Dynamic processors demand dynamic operating systems

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Design techniques for cross-layer resilience

Proceedings of the Conference on Design, Automation and Test in Europe
Scheduling for energy efficiency and fault tolerance in hard real-time systems

Proceedings of the Conference on Design, Automation and Test in Europe
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Performance-asymmetry-aware scheduling for Chip Multiprocessors with static core coupling

Journal of Systems Architecture: the EUROMICRO Journal
On the exploitation of narrow-width values for improving register file reliability

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Architecting high-performance energy-efficient soft error resilient cache under 3D integration technology

Microprocessors & Microsystems
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems

EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
Releasing efficient beta cores to market early

Proceedings of the 38th annual international symposium on Computer architecture
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
A self-checking hardware journal for a fault-tolerant processor architecture

International Journal of Reconfigurable Computing - Special issue on selected papers from the international workshop on reconfigurable communication-centric systems on chips (ReCoSoC' 2010)
Soft core based embedded systems in critical aerospace applications

Journal of Systems Architecture: the EUROMICRO Journal
Trade-offs in transient fault recovery schemes for redundant multithreaded processors

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Chameleon: operating system support for dynamic processors

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Encore: low-cost, fine-grained transient fault recovery

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Resource-Driven optimizations for transient-fault detecting superscalar microarchitectures

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Exploiting inactive rename slots for detecting soft errors

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Quasi-static fault-tolerant scheduling schemes for energy-efficient hard real-time systems

Journal of Systems and Software
Efficient soft error protection for commodity embedded microprocessors using profile information

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Runtime asynchronous fault tolerance via speculation

Proceedings of the Tenth International Symposium on Code Generation and Optimization
A first-order mechanistic model for architectural vulnerability factor

Proceedings of the 39th Annual International Symposium on Computer Architecture
Data flow analysis for anomaly detection and identification toward resiliency in extreme scale systems

The Journal of Supercomputing
RISE: improving the streaming processors reliability against soft errors in gpgpus

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Operating system support for redundant multithreading

Proceedings of the tenth ACM international conference on Embedded software
Dynamic transient fault detection and recovery for embedded processor datapaths

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Time-Constraint-Aware Optimization of Assertions in Embedded Software

Journal of Electronic Testing: Theory and Applications
Who watches the watchmen? - protecting operating system reliability mechanisms

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Warped-DMR: Light-weight Error Detection for GPGPU

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Low cost control flow protection using abstract control signatures

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory

Proceedings of the ACM International Conference on Computing Frontiers
FaulTM: error detection and recovery using hardware transactional memory

Proceedings of the Conference on Design, Automation and Test in Europe
A work-stealing scheduling framework supporting fault tolerance

Proceedings of the Conference on Design, Automation and Test in Europe
Reli: hardware/software checkpoint and recovery scheme for embedded processors

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A survey of checker architectures

ACM Computing Surveys (CSUR)
A dual process redundancy approach to transient fault tolerance for ccNUMA architecture

Neurocomputing
Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Selective SWIFT-R

Journal of Electronic Testing: Theory and Applications
A dynamic approach to tolerate soft errors

Cluster Computing
A low-power instruction replay mechanism for design of resilient microprocessors

ACM Transactions on Embedded Computing Systems (TECS)
Epipe: A low-cost fault-tolerance technique considering WCET constraints

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor faults. The components are lockstepped (cycle-by-cycle synchronized) to ensure that, in each cycle, they perform the same operation on the same inputs, producing the same outputs in the absence of faults. Unfortunately, for a given hardware budget, full replication reduces performance by statically partitioning resources among redundant operations.We demonstrate that a Simultaneous and Redundantly Threaded (SRT) processor—derived from a Simultaneous Multithreaded (SMT) processor—provides transient fault coverage with significantly higher performance. An SRT processor provides transient fault coverage by running identical copies of the same program simultaneously as independent threads. An SRT processor provides higher performance because it dynamically schedules its hardware resources among the redundant copies. However, dynamic scheduling makes it difficult to implement lockstepping, because corresponding instructions from redundant threads may not execute in the same cycle or in the same order. This paper makes four contributions to the design of SRT processors. First, we introduce the concept of the sphere of replication, which abstracts both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor. This framework aids in identifying the scope of fault coverage and the input and output values requiring special handling. Second, we identify two viable spheres of replication in an SRT processor, and show that one of them provides fault detection while checking only committed stores and uncached loads. Third, we identify the need for consistent replication of load values, and propose and evaluate two new mechanisms for satisfying this requirement. Finally, we propose and evaluate two mechanisms—slack fetch and branch outcome queue—that enhance the performance of an SRT processor by allowing one thread to prefetch cache misses and branch results for the other thread. Our results with 11 SPEC95 benchmarks show that an SRT processor can outperform an equivalently sized, on-chip, hardware-replicated solution by 16% on average, with a maximum benefit of up to 29%.