A survey of checker architectures

Authors:
Rajshekar Kalayappan;Smruti R. Sarangi
Affiliations:
Indian Institute of Technology, New Delhi, India;Indian Institute of Technology, New Delhi, India
Venue:
ACM Computing Surveys (CSUR)
Year:
2013

Citing 77
Cited 0

Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
IBM experiments in soft fails in computer electronics (1978–1994)

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Reflections on the Pentium Division Bug

IEEE Transactions on Computers
Lamport clocks: verifying a directory cache-coherence protocol

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection

IEEE Transactions on Parallel and Distributed Systems
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ACM SIGPLAN Notices
Efficient checker processor design

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A study of slipstream processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Shared Memory Consistency Models: A Tutorial

Computer
Exploiting Instruction-Level Parallelism for Integrated Control-Flow Monitoring

IEEE Transactions on Computers
Soft-Error Detection through Software Fault-Tolerance Techniques

DFT '99 Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems
REESE: A Method of Soft Error Detection in Microprocessors

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Master/slave speculative parallelization

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Evaluation of integrated system-level checks for on-line error detection

IPDS '96 Proceedings of the 2nd International Computer Performance and Dependability Symposium (IPDS '96)
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Soft-Error Detection Using Control Flow Assertions

DFT '03 Proceedings of the 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems
Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Verification: what works and what doesn't

Proceedings of the 41st annual Design Automation Conference
TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

Proceedings of the 31st annual international symposium on Computer architecture
A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy

Proceedings of the 31st annual international symposium on Computer architecture
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Microarchitecture and Design Challenges for Gigascale Integration

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Improving Multiple-CMP Systems Using Token Coherence

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
SWIFT: Software Implemented Fault Tolerance

Proceedings of the international symposium on Code generation and optimization
Opportunistic Transient-Fault Detection

Proceedings of the 32nd annual international symposium on Computer Architecture
Dynamic Verification of Sequential Consistency

Proceedings of the 32nd annual international symposium on Computer Architecture
NonStop® Advanced Architecture

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Software-Based Transparent and Comprehensive Control-Flow Error Detection

Proceedings of the International Symposium on Code Generation and Optimization
Automatic Instruction-Level Software-Only Recovery

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
CEDA: Control-flow Error Detection through Assertions

IOLTS '06 Proceedings of the 12th IEEE International Symposium on On-Line Testing
ReStore: Symptom-Based Soft Error Detection in Microprocessors

IEEE Transactions on Dependable and Secure Computing
Ultra low-cost defect protection for microprocessor pipelines

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Reunion: Complexity-Effective Multicore Redundancy

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Fault-Tolerant Systems

Fault-Tolerant Systems
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Patching Processor Design Errors with Programmable Hardware

IEEE Micro
Automated Derivation of Application-aware Error Detectors using Static Analysis

IOLTS '07 Proceedings of the 13th IEEE International On-Line Testing Symposium
Error Detection Using Dynamic Dataflow Verification

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
The N-Version Approach to Fault-Tolerant Software

IEEE Transactions on Software Engineering
Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Self-calibrating Online Wearout Detection

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Techniques to mitigate the effects of congenital faults in processors

Techniques to mitigate the effects of congenital faults in processors
Hierarchical Verification for Increasing Performance in Reliable Processors

Journal of Electronic Testing: Theory and Applications
Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures

IEEE Transactions on Dependable and Secure Computing
Facelift: Hiding and slowing down aging in multicores

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Online design bug detection: RTL analysis, flexible mechanisms, and evaluation

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A performance-correctness explicitly-decoupled architecture

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
End-to-end register data-flow continuous self-test

Proceedings of the 36th annual international symposium on Computer architecture
Extending SRT for parallel applications in tiled-CMP architectures

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

IBM Journal of Research and Development
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Architectures for Extreme-Scale Computing

Computer
Specifying and dynamically verifying address translation-aware memory consistency

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Necromancer: enhancing system throughput by animating dead cores

Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
Sampling + DMR: practical and low-overhead permanent fault detection

Proceedings of the 38th annual international symposium on Computer architecture
A methodology for the generation of efficient error detection mechanisms

DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
Application-aware diagnosis of runtime hardware faults

Proceedings of the International Conference on Computer-Aided Design
Accelerating microprocessor silicon validation by exposing ISA diversity

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Resource-Driven optimizations for transient-fault detecting superscalar microarchitectures

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reliability is quickly becoming a primary design constraint for high-end processors because of the inherent limits of manufacturability, extreme miniaturization of transistors, and the growing complexity of large multicore chips. To achieve a high degree of fault tolerance, we need to detect faults quickly and try to rectify them. In this article, we focus on the former aspect. We present a survey of different kinds of fault detection mechanisms for processors at circuit, architecture, and software level. We collectively refer to such mechanisms as checker architectures. First, we propose a novel two-level taxonomy for different classes of checkers based on their structure and functionality. Subsequently, for each class we present the ideas in some of the seminal papers that have defined the direction of the area along with important extensions published in later work.