Perturbation-based Fault Screening

Authors:
Paul Racunas;Kypros Constantinides;Srilatha Manne;Shubhendu S. Mukherjee
Affiliations:
FACT Group, Intel Corp., Hudson, MA 01749;Dept. of Computer Science and Engineering, University of Michigan, Ann Arbor, MI 48105;ITPP Group, Intel Corp., DuPont, WA 98327;FACT Group, Intel Corp., Hudson, MA 01749
Venue:
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Year:
2007

Citing 0
Cited 12

Examining ACE analysis reliability estimates using fault-injection

Proceedings of the 34th annual international symposium on Computer architecture
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Automatic software fault localization using generic program invariants

Proceedings of the 2008 ACM symposium on Applied computing
Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Checkpoint allocation and release

ACM Transactions on Architecture and Code Optimization (TACO)
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Synchronizing redundant cores in a dynamic DMR multicore architecture

IEEE Transactions on Circuits and Systems II: Express Briefs
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Shoestring: probabilistic soft error reliability on the cheap

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Encore: low-cost, fine-grained transient fault recovery

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Quantitative evaluation of soft error injection techniques for robust system design

Proceedings of the 50th Annual Design Automation Conference

Quantified Score

Hi-index	0.01

Visualization

Abstract

Fault screeners are a new breed of fault identification technique that can probabilistically detect if a transient fault has affected the state of a processor. We demonstrate that fault screeners function because of two key characteristics. First, we show that much of the intermediate data generated by a program inherently falls within certain consistent bounds. Second, we observe that these bounds are often violated by the introduction of a fault. Thus, fault screeners can identify faults by directly watching for any data inconsistencies arising in an application's behavior. We present an idealized algorithm capable of identifying over 85% of injected faults on the SpecInt suite and over 75% overall. Further, in a realistic implementation on a simulated Pentium-III-like processor, about half of the errors due to injected faults are identified while still in speculative state. Errors detected this early can be eliminated by a pipeline flush. In this paper, we present several hardware-based versions of this screening algorithm and show that flushing the pipeline every time the hardware screener triggers reduces overall performance by less than 1%