Understanding the propagation of hard errors to software and implications for resilient system design

Authors:
Man-Lap Li;Pradeep Ramachandran;Swarup Kumar Sahoo;Sarita V. Adve;Vikram S. Adve;Yuanyuan Zhou
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Year:
2008

Citing 44
Cited 31

FERRARI: A Flexible Software-Based Fault and Error Injection System

IEEE Transactions on Computers - Special issue on fault-tolerant computing
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
A reconfigurable multi-function computing cache architecture

FPGA '00 Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Full-system timing-first simulation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Tracking down software bugs using automatic anomaly detection

Proceedings of the 24th International Conference on Software Engineering
Fault Injection Techniques and Tools

Computer
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems

IEEE Transactions on Computers
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Soft-Error Detection Using Control Flow Assertions

DFT '03 Proceedings of the 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
iWatcher: Efficient Architectural Support for Software Debugging

Proceedings of the 31st annual international symposium on Computer architecture
Tolerating Hard Faults in Microprocessor Array Structures

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The Impact of Technology Scaling on Lifetime Reliability

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Microarchitecture and Design Challenges for Gigascale Integration

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Scalable statistical bug isolation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
NonStop® Advanced Architecture

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
An Experimental Study of Soft Errors in Microprocessors

IEEE Micro
Software-controlled fault tolerance

ACM Transactions on Architecture and Code Optimization (TACO)
ReStore: Symptom-Based Soft Error Detection in Microprocessors

IEEE Transactions on Dependable and Secure Computing
AVIO: detecting atomicity violations via access interleaving invariants

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Ultra low-cost defect protection for microprocessor pipelines

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware

EDCC '06 Proceedings of the Sixth European Dependable Computing Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Error Detection Using Dynamic Dataflow Verification

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
HARD: Hardware-Assisted Lockset-based Race Detection

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Perturbation-based Fault Screening

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
The Daikon system for dynamic detection of likely invariants

Science of Computer Programming
Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
RAS strategy for IBM S/390 G5 and G6

IBM Journal of Research and Development
An architectural framework for detecting process hangs/crashes

EDCC'05 Proceedings of the 5th European conference on Dependable Computing

Mixed-mode multicore reliability

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Tolerating hardware device failures in software

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Specifying and dynamically verifying address translation-aware memory consistency

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Shoestring: probabilistic soft error reliability on the cheap

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
IVF: characterizing the vulnerability of microprocessor structures to intermittent faults

Proceedings of the Conference on Design, Automation and Test in Europe
Stealth works: emulating memory errors

RV'10 Proceedings of the First international conference on Runtime verification
Sampling + DMR: practical and low-overhead permanent fault detection

Proceedings of the 38th annual international symposium on Computer architecture
ROSY: recovering processor and memory systems from hard errors

ACM SIGOPS Operating Systems Review
Feedback control based cache reliability enhancement for emerging multicores

Proceedings of the International Conference on Computer-Aided Design
Assuring application-level correctness against soft errors

Proceedings of the International Conference on Computer-Aided Design
Application-aware diagnosis of runtime hardware faults

Proceedings of the International Conference on Computer-Aided Design
Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Encore: low-cost, fine-grained transient fault recovery

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient soft error protection for commodity embedded microprocessors using profile information

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Viper: virtual pipelines for enhanced reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
A defect-tolerant accelerator for emerging high-performance applications

Proceedings of the 39th Annual International Symposium on Computer Architecture
Practical hardening of crash-tolerant systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Operating system support for redundant multithreading

Proceedings of the tenth ACM international conference on Embedded software
Memory array protection: check on read or check on write?

Proceedings of the Conference on Design, Automation and Test in Europe
FaulTM: error detection and recovery using hardware transactional memory

Proceedings of the Conference on Design, Automation and Test in Europe
Accurate and efficient reliability estimation techniques during ADL-driven embedded processor design

Proceedings of the Conference on Design, Automation and Test in Europe
IVF: characterizing the vulnerability of microprocessor structures to intermittent faults

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
CrashTest'ing SWAT: accurate, gate-level evaluation of symptom-based resiliency solutions

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A survey of checker architectures

ACM Computing Surveys (CSUR)
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
VarEMU: an emulation testbed for variability-aware software

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
Toward predictable, efficient, system-level tolerance of transient faults

ACM SIGBED Review - Special Issue on the 5th Workshop on Adaptive and Reconfigurable Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults. Fundamental to such a solution is a characterization of how hardware faults indifferent microarchitectural structures of a modern processor propagate through the application and OS. This paper aims to provide such a characterization, resulting in identifying low-cost detection methods and providing guidelines for implementation of the recovery and diagnosis components of such a reliability solution. We focus on hard faults because they are increasingly important and have different system implications than the much studied transients. We achieve our goals through fault injection experiments with a microarchitecture-level full system timing simulator. Our main results are: (1) we are able to detect 95% of the unmasked faults in 7 out of 8 studied microarchitectural structures with simple detectors that incur zero to little hardware overhead; (2) over 86% of these detections are within latencies that existing hardware checkpointing schemes can handle, while others require software checkpointing; and (3) a surprisingly large fraction of the detected faults corrupt OS state, but almost all of these are detected with latencies short enough to use hardware checkpointing, thereby enabling OS recovery in virtually all such cases.