Transient-fault recovery for chip multiprocessors

Authors:
Mohamed Gomaa;Chad Scarbrough;T. N. Vijaykumar;Irith Pomeranz
Affiliations:
Purdue University;Purdue University;Purdue University;Purdue University
Venue:
Proceedings of the 30th annual international symposium on Computer architecture
Year:
2003

Citing 11
Cited 72

Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
IBM's S/390 G5 Microprocessor Design

IEEE Micro
Concurrent Error Detection Using Watchdog Processors-A Survey

IEEE Transactions on Computers
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing

A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy

Proceedings of the 31st annual international symposium on Computer architecture
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A Case for Clumsy Packet Processors

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
SWIFT: Software Implemented Fault Tolerance

Proceedings of the international symposium on Code generation and optimization
Improving java virtual machine reliability for memory-constrained embedded systems

Proceedings of the 42nd annual Design Automation Conference
Design and Evaluation of Hybrid Fault-Detection Systems

Proceedings of the 32nd annual international symposium on Computer Architecture
Opportunistic Transient-Fault Detection

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing inter-processor data locality on embedded chip multiprocessors

Proceedings of the 5th ACM international conference on Embedded software
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Power-Efficient Error Tolerance in Chip Multiprocessors

IEEE Micro
Compiler-directed channel allocation for saving power in on-chip networks

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Software-controlled fault tolerance

ACM Transactions on Architecture and Code Optimization (TACO)
Using loop invariants to fight soft errors in data caches

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Code restructuring for improving cache performance of MPSoCs

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Dynamic partitioning of processing and memory resources in embedded MPSoC architectures

Proceedings of the conference on Design, automation and test in Europe: Proceedings
An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors

Proceedings of the 33rd annual international symposium on Computer Architecture
Software based fault tolerance: a survey

Ubiquity
Self-checking instructions: reducing instruction redundancy for concurrent error detection

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Hardware support for spin management in overcommitted virtual machines

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Static typing for a faulty lambda calculus

Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Reunion: Complexity-Effective Multicore Redundancy

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Architecting a reliable CMP switch architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Configurable isolation: building high availability systems with commodity multi-core processors

Proceedings of the 34th annual international symposium on Computer architecture
Dynamic prediction of architectural vulnerability from microarchitectural state

Proceedings of the 34th annual international symposium on Computer architecture
Fault-tolerant typed assembly language

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection

Proceedings of the International Symposium on Code Generation and Optimization
Modeling and improving data cache reliability: 1

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Transient fault prediction based on anomalies in processor events

Proceedings of the conference on Design, automation and test in Europe
Isolation in Commodity Multicore Processors

Computer
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Dependability, power, and performance trade-off on a multicore processor

Proceedings of the 2008 Asia and South Pacific Design Automation Conference
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Efficient fault tolerance in multi-media applications through selective instruction replication

Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies
A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Skewed redundancy

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
An Approach for Enhancing Inter-processor Data Locality on Chip Multiprocessors

Transactions on High-Performance Embedded Architectures and Compilers I
Mixed-mode multicore reliability

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Implementing high availability memory with a duplication cache

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
ESoftCheck: Removal of Non-vital Checks for Fault Tolerance

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
End-to-end register data-flow continuous self-test

Proceedings of the 36th annual international symposium on Computer architecture
AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware

SAFECOMP '09 Proceedings of the 28th International Conference on Computer Safety, Reliability, and Security
REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Synchronizing redundant cores in a dynamic DMR multicore architecture

IEEE Transactions on Circuits and Systems II: Express Briefs
Selective replication: A lightweight technique for soft errors

ACM Transactions on Computer Systems (TOCS)
Shoestring: probabilistic soft error reliability on the cheap

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Improving chip multiprocessor reliability through code replication

Computers and Electrical Engineering
Energy-efficient redundant execution for chip multiprocessors

Proceedings of the 20th symposium on Great lakes symposium on VLSI
Modeling soft errors for data caches and alleviating their effects on data reliability

Microprocessors & Microsystems
DAFT: decoupled acyclic fault tolerance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Performance-asymmetry-aware scheduling for Chip Multiprocessors with static core coupling

Journal of Systems Architecture: the EUROMICRO Journal
On the design and analysis of fault tolerant NoC architecture using spare routers

Proceedings of the 16th Asia and South Pacific Design Automation Conference
On the exploitation of narrow-width values for improving register file reliability

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Resilience of mutual exclusion algorithms to transient memory faults

Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Sampling + DMR: practical and low-overhead permanent fault detection

Proceedings of the 38th annual international symposium on Computer architecture
Releasing efficient beta cores to market early

Proceedings of the 38th annual international symposium on Computer architecture
A self-checking hardware journal for a fault-tolerant processor architecture

International Journal of Reconfigurable Computing - Special issue on selected papers from the international workshop on reconfigurable communication-centric systems on chips (ReCoSoC' 2010)
Trade-offs in transient fault recovery schemes for redundant multithreaded processors

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Resource-Driven optimizations for transient-fault detecting superscalar microarchitectures

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Efficient soft error protection for commodity embedded microprocessors using profile information

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
UniFI: leveraging non-volatile memories for a unified fault tolerance and idle power management technique

Proceedings of the 26th ACM international conference on Supercomputing
Dynamic transient fault detection and recovery for embedded processor datapaths

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
FaulTM: error detection and recovery using hardware transactional memory

Proceedings of the Conference on Design, Automation and Test in Europe
A survey of checker architectures

ACM Computing Surveys (CSUR)
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

To address the increasing susceptibility of commodity chip multiprocessors (CMPs) to transient faults, we propose Chiplevel Redundantly Threaded multiprocessor with Recovery (CRTR). CRTR extends the previously-proposed CRT for transient-fault detection in CMPs, and the previously-proposed SRTR for transient-fault recovery in SMT. All these schemes achieve fault tolerance by executing and comparing two copies, called leading and trailing threads, of a given application. Previous recovery schemes for SMT do not perform well on CMPs. In a CMP, the leading and trailing threads execute on different processors to achieve load balancing and reduce the probability of a fault corrupting both threads; whereas in an SMT, both threads execute on the same processor. The inter-processor communication required to compare the threads introduces latency and bandwidth problems not present in an SMT.To hide inter-processor latency, CRTR executes the leading thread ahead of the trailing thread by maintaining a long slack, enabled by asymmetric commit. CRTR commits the leading thread before checking and the trailing thread after checking, so that the trailing thread state may be used for recovery. Previous recovery schemes commit both threads after checking, making a long slack suboptimal. To tackle inter-processor bandwidth, CRTR not only increases the bandwidth supply by pipelining the communication paths, but also reduces the bandwidth demand. By reasoning that faults propagate through dependences, previously-proposed Dependence-Based Checking Elision (DBCE) exploits (true) register dependence chains so that only the value of the last instruction in a chain is checked. However, instructions that mask operand bits may mask faults and limit the use of dependence chains. We propose Death- and Dependence-Based Checking Elision (DDBCE), which chains a masking instruction only if the source operand of the instruction dies after the instruction. Register deaths ensure that masked faults do not corrupt later computation. Using SPEC2000, we show that CRTR incurs negligible performance loss compared to CRT for inter-processor (one-way) latency as high as 30 cycles, and that the bandwidth requirements of CRT and CRTR with DDBCE are 5.2 and 7.1 bytes/cycle, respectively.