Probabilistic accuracy bounds for fault-tolerant computations that discard tasks

Authors:
Martin Rinard
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA
Venue:
Proceedings of the 20th annual international conference on Supercomputing
Year:
2006

Citing 13
Cited 24

Ray tracing complex scenes

SIGGRAPH '86 Proceedings of the 13th annual conference on Computer graphics and interactive techniques
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Finding and exploiting parallelism in an ocean simulation program: experience, results, and implications

Journal of Parallel and Distributed Computing
The design, implementation and evaluation of Jade: a portable, implicitly parallel programming language

The design, implementation and evaluation of Jade: a portable, implicitly parallel programming language
Software Fault Tolerance

Software Fault Tolerance
SAS System for Regression,Third Edition

SAS System for Regression,Third Edition
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs

IEEE Transactions on Parallel and Distributed Systems
Acceptability-oriented computing

OOPSLA '03 Companion of the 18th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Data structure repair using goal-directed reasoning

Proceedings of the 27th international conference on Software engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference

Goal-Directed Reasoning for Specification-Based Data Structure Repair

IEEE Transactions on Software Engineering
Using early phase termination to eliminate load imbalances at barrier synchronization points

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Bristlecone: A Language for Robust Software Systems

ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Green: a framework for supporting energy-conscious programming using controlled approximation

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Quality of service profiling

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Automatically identifying critical input regions and code in applications

Proceedings of the 19th international symposium on Software testing and analysis
Patterns and statistical analysis for understanding reduced resource computing

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Probabilistic accuracy bounds for perforated programs: a new foundation for program analysis and transformation

Proceedings of the 20th ACM SIGPLAN workshop on Partial evaluation and program manipulation
Dynamic knobs for responsive power-aware computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Managing performance vs. accuracy trade-offs with loop perforation

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Probabilistically accurate program transformations

SAS'11 Proceedings of the 18th international conference on Static analysis
Efficiently speeding up sequential computation through the n-way programming model

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Randomized accuracy-aware program transformations for efficient approximate computations

POPL '12 Proceedings of the 39th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
What to do when things go wrong: recovery in complex (computer) systems

Proceedings of the 11th annual international conference on Aspect-oriented Software Development Companion
Language and compiler support for auto-tuning variable-accuracy algorithms

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Obtaining and reasoning about good enough software

Proceedings of the 49th Annual Design Automation Conference
Proving acceptability properties of relaxed nondeterministic approximate programs

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Bolt: on-demand infinite loop escape in unmodified binaries

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Dancing with uncertainty

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Verified integrity properties for safe approximate program transformations

PEPM '13 Proceedings of the ACM SIGPLAN 2013 workshop on Partial evaluation and program manipulation
Parallelizing Sequential Programs with Statistical Accuracy Tests

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Probabilistic Embedded Computing
Verifying quantitative reliability for programs that execute on unreliable hardware

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
SAGE: self-tuning approximation for graphics engines

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Paraprox: pattern-based approximation for data parallel applications

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new technique for enabling computations to survive errors and faults while providing a bound on any resulting output distortion. A developer using the technique first partitions the computation into tasks. The execution platform then simply discards any task that encounters an error or a fault and completes the computation by executing any remaining tasks. This technique can substantially improve the robustness of the computation in the face of errors and faults. A potential concern is that discarding tasks may change the result that the computation produces.Our technique randomly samples executions of the program at varying task failure rates to obtain a quantitative, probabilistic model that characterizes the distortion of the output as a function of the task failure rates. By providing probabilistic bounds on the distortion, the model allows users to confidently accept results produced by executions with failures as long as the distortion falls within acceptable bounds. This approach may prove to be especially useful for enabling computations to successfully survive hardware failures in distributed computing environments.Our technique also produces a timing model that characterizes the execution time as a function of the task failure rates. The combination of the distortion and timing models quantifies an accuracy/execution time tradeoff. It therefore enables the development of techniques that purposefully fail tasks to reduce the execution time while keeping the distortion within acceptable bounds.