Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
IEEE Transactions on Computers
IBM experiments in soft fails in computer electronics (1978–1994)
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Reflections on the Pentium Division Bug
IEEE Transactions on Computers
Lamport clocks: verifying a directory cache-coherence protocol
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection
IEEE Transactions on Parallel and Distributed Systems
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance
ACM SIGPLAN Notices
Efficient checker processor design
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A study of slipstream processors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Exploiting Instruction-Level Parallelism for Integrated Control-Flow Monitoring
IEEE Transactions on Computers
Soft-Error Detection through Software Fault-Tolerance Techniques
DFT '99 Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems
REESE: A Method of Soft Error Detection in Microprocessors
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Master/slave speculative parallelization
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Evaluation of integrated system-level checks for on-line error detection
IPDS '96 Proceedings of the 2nd International Computer Performance and Dependability Symposium (IPDS '96)
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Soft-Error Detection Using Control Flow Assertions
DFT '03 Proceedings of the 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems
Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Verification: what works and what doesn't
Proceedings of the 41st annual Design Automation Conference
TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model
Proceedings of the 31st annual international symposium on Computer architecture
Proceedings of the 31st annual international symposium on Computer architecture
Fingerprinting: bounding soft-error detection latency and bandwidth
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Microarchitecture and Design Challenges for Gigascale Integration
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Improving Multiple-CMP Systems Using Token Coherence
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
SWIFT: Software Implemented Fault Tolerance
Proceedings of the international symposium on Code generation and optimization
Opportunistic Transient-Fault Detection
Proceedings of the 32nd annual international symposium on Computer Architecture
Dynamic Verification of Sequential Consistency
Proceedings of the 32nd annual international symposium on Computer Architecture
NonStop® Advanced Architecture
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Software-Based Transparent and Comprehensive Control-Flow Error Detection
Proceedings of the International Symposium on Code Generation and Optimization
Automatic Instruction-Level Software-Only Recovery
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
CEDA: Control-flow Error Detection through Assertions
IOLTS '06 Proceedings of the 12th IEEE International Symposium on On-Line Testing
ReStore: Symptom-Based Soft Error Detection in Microprocessors
IEEE Transactions on Dependable and Secure Computing
Ultra low-cost defect protection for microprocessor pipelines
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Reunion: Complexity-Effective Multicore Redundancy
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Fault-Tolerant Systems
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Automated Derivation of Application-aware Error Detectors using Static Analysis
IOLTS '07 Proceedings of the 13th IEEE International On-Line Testing Symposium
Error Detection Using Dynamic Dataflow Verification
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
The N-Version Approach to Fault-Tolerant Software
IEEE Transactions on Software Engineering
Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Self-calibrating Online Wearout Detection
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Techniques to mitigate the effects of congenital faults in processors
Techniques to mitigate the effects of congenital faults in processors
Hierarchical Verification for Increasing Performance in Reliable Processors
Journal of Electronic Testing: Theory and Applications
Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures
IEEE Transactions on Dependable and Secure Computing
Facelift: Hiding and slowing down aging in multicores
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Online design bug detection: RTL analysis, flexible mechanisms, and evaluation
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A performance-correctness explicitly-decoupled architecture
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
End-to-end register data-flow continuous self-test
Proceedings of the 36th annual international symposium on Computer architecture
Extending SRT for parallel applications in tiled-CMP architectures
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective
IBM Journal of Research and Development
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Specifying and dynamically verifying address translation-aware memory consistency
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Necromancer: enhancing system throughput by animating dead cores
Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
Sampling + DMR: practical and low-overhead permanent fault detection
Proceedings of the 38th annual international symposium on Computer architecture
A methodology for the generation of efficient error detection mechanisms
DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
Application-aware diagnosis of runtime hardware faults
Proceedings of the International Conference on Computer-Aided Design
Accelerating microprocessor silicon validation by exposing ISA diversity
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Resource-Driven optimizations for transient-fault detecting superscalar microarchitectures
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.00 |
Reliability is quickly becoming a primary design constraint for high-end processors because of the inherent limits of manufacturability, extreme miniaturization of transistors, and the growing complexity of large multicore chips. To achieve a high degree of fault tolerance, we need to detect faults quickly and try to rectify them. In this article, we focus on the former aspect. We present a survey of different kinds of fault detection mechanisms for processors at circuit, architecture, and software level. We collectively refer to such mechanisms as checker architectures. First, we propose a novel two-level taxonomy for different classes of checkers based on their structure and functionality. Subsequently, for each class we present the ideas in some of the seminal papers that have defined the direction of the area along with important extensions published in later work.