Application-Level Correctness and its Impact on Fault Tolerance

Authors:
Xuanhua Li;Donald Yeung
Affiliations:
Department of Electrical and Computer Engineering, University of Maryland, College Park. xli@eng.umd.edu;Department of Electrical and Computer Engineering, University of Maryland, College Park. yeung@eng.umd.edu
Venue:
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Year:
2007

Citing 0
Cited 20

Efficient fault tolerance in multi-media applications through selective instruction replication

Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies
Exploiting selective placement for low-cost memory protection

ACM Transactions on Architecture and Code Optimization (TACO)
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Shoestring: probabilistic soft error reliability on the cheap

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Green: a framework for supporting energy-conscious programming using controlled approximation

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
Flikker: saving DRAM refresh-power through critical data partitioning

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
ConSeq: detecting concurrency bugs through sequential errors

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
EnerJ: approximate data types for safe and general low-power computation

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Sampling + DMR: practical and low-overhead permanent fault detection

Proceedings of the 38th annual international symposium on Computer architecture
Assuring application-level correctness against soft errors

Proceedings of the International Conference on Computer-Aided Design
Encore: low-cost, fine-grained transient fault recovery

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Classification-Based improvement of application robustness and quality of service in probabilistic computer systems

ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems
Low cost control flow protection using abstract control signatures

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Formal performance analysis for faulty MIMO hardware

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Verifying quantitative reliability for programs that execute on unreliable hardware

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Improving the fault resilience of an H.264 decoder using static analysis methods

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on ESTIMedia'10
SAGE: self-tuning approximation for graphics engines

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Epipe: A low-cost fault-tolerance technique considering WCET constraints

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditionally, fault tolerance researchers have required architectural state to be numerically perfect for program execution to be correct. However, in many programs, even if execution is not 100% numerically correct, the program can still appear to execute correctly from the user's perspective. Hence, whether a fault is unacceptable or benign may depend on the level of abstraction at which correctness is evaluated, with more faults being benign at higher levels of abstraction, i.e. at the user or application level, compared to lower levels of abstraction, i.e. at the architecture level. The extent to which programs are more fault resilient at higher levels of abstraction is application dependent. Programs that produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and we find they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that compute exact numerical outputs offer less error resilience at the application level. However, we find all programs studied in this paper exhibit some enhanced fault resilience at the application level, including those that are traditionally considered exact computations-e.g., SPECInt CPU2000. This paper investigates definitions of program correctness that view correctness from the application's standpoint rather than the architecture's standpoint. Under application-level correctness, a program's execution is deemed correct as long as the result it produces is acceptable to the user. To quantify user satisfaction, we rely on application-level fidelity metrics that capture user perceived program solution quality.