FERRARI: A Flexible Software-Based Fault and Error Injection System
IEEE Transactions on Computers - Special issue on fault-tolerant computing
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
A reconfigurable multi-function computing cache architecture
FPGA '00 Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Full-system timing-first simulation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Tracking down software bugs using automatic anomaly detection
Proceedings of the 24th International Conference on Software Engineering
Fault Injection Techniques and Tools
Computer
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems
IEEE Transactions on Computers
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Soft-Error Detection Using Control Flow Assertions
DFT '03 Proceedings of the 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
iWatcher: Efficient Architectural Support for Software Debugging
Proceedings of the 31st annual international symposium on Computer architecture
Tolerating Hard Faults in Microprocessor Array Structures
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The Impact of Technology Scaling on Lifetime Reliability
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Microarchitecture and Design Challenges for Gigascale Integration
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Scalable statistical bug isolation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
NonStop® Advanced Architecture
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Software-controlled fault tolerance
ACM Transactions on Architecture and Code Optimization (TACO)
ReStore: Symptom-Based Soft Error Detection in Microprocessors
IEEE Transactions on Dependable and Secure Computing
AVIO: detecting atomicity violations via access interleaving invariants
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Ultra low-cost defect protection for microprocessor pipelines
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware
EDCC '06 Proceedings of the Sixth European Dependable Computing Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Error Detection Using Dynamic Dataflow Verification
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
HARD: Hardware-Assisted Lockset-based Race Detection
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Perturbation-based Fault Screening
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
The Daikon system for dynamic detection of likely invariants
Science of Computer Programming
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
RAS strategy for IBM S/390 G5 and G6
IBM Journal of Research and Development
An architectural framework for detecting process hangs/crashes
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Mixed-mode multicore reliability
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Tolerating hardware device failures in software
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Specifying and dynamically verifying address translation-aware memory consistency
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Shoestring: probabilistic soft error reliability on the cheap
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
IVF: characterizing the vulnerability of microprocessor structures to intermittent faults
Proceedings of the Conference on Design, Automation and Test in Europe
Stealth works: emulating memory errors
RV'10 Proceedings of the First international conference on Runtime verification
Sampling + DMR: practical and low-overhead permanent fault detection
Proceedings of the 38th annual international symposium on Computer architecture
ROSY: recovering processor and memory systems from hard errors
ACM SIGOPS Operating Systems Review
Feedback control based cache reliability enhancement for emerging multicores
Proceedings of the International Conference on Computer-Aided Design
Assuring application-level correctness against soft errors
Proceedings of the International Conference on Computer-Aided Design
Application-aware diagnosis of runtime hardware faults
Proceedings of the International Conference on Computer-Aided Design
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Encore: low-cost, fine-grained transient fault recovery
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient soft error protection for commodity embedded microprocessors using profile information
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Viper: virtual pipelines for enhanced reliability
Proceedings of the 39th Annual International Symposium on Computer Architecture
A defect-tolerant accelerator for emerging high-performance applications
Proceedings of the 39th Annual International Symposium on Computer Architecture
Practical hardening of crash-tolerant systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Operating system support for redundant multithreading
Proceedings of the tenth ACM international conference on Embedded software
Memory array protection: check on read or check on write?
Proceedings of the Conference on Design, Automation and Test in Europe
FaulTM: error detection and recovery using hardware transactional memory
Proceedings of the Conference on Design, Automation and Test in Europe
Accurate and efficient reliability estimation techniques during ADL-driven embedded processor design
Proceedings of the Conference on Design, Automation and Test in Europe
IVF: characterizing the vulnerability of microprocessor structures to intermittent faults
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
CrashTest'ing SWAT: accurate, gate-level evaluation of symptom-based resiliency solutions
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A survey of checker architectures
ACM Computing Surveys (CSUR)
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
VarEMU: an emulation testbed for variability-aware software
Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
Toward predictable, efficient, system-level tolerance of transient faults
ACM SIGBED Review - Special Issue on the 5th Workshop on Adaptive and Reconfigurable Embedded Systems
Hi-index | 0.00 |
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults. Fundamental to such a solution is a characterization of how hardware faults indifferent microarchitectural structures of a modern processor propagate through the application and OS. This paper aims to provide such a characterization, resulting in identifying low-cost detection methods and providing guidelines for implementation of the recovery and diagnosis components of such a reliability solution. We focus on hard faults because they are increasingly important and have different system implications than the much studied transients. We achieve our goals through fault injection experiments with a microarchitecture-level full system timing simulator. Our main results are: (1) we are able to detect 95% of the unmasked faults in 7 out of 8 studied microarchitectural structures with simple detectors that incur zero to little hardware overhead; (2) over 86% of these detections are within latencies that existing hardware checkpointing schemes can handle, while others require software checkpointing; and (3) a surprisingly large fraction of the detected faults corrupt OS state, but almost all of these are detected with latencies short enough to use hardware checkpointing, thereby enabling OS recovery in virtually all such cases.