Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

Authors:
Dong Li;Jeffrey S. Vetter;Weikuan Yu
Affiliations:
Oak Ridge National Laboratory;Oak Ridge National Laboratory, Georgia Institute of Technology;Auburn University
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 28
Cited 3

A model of roll-back recovery with multiple checkpoints

ICSE '76 Proceedings of the 2nd international conference on Software engineering
A Case of Multi-Level Distributed Recovery Schemes

A Case of Multi-Level Distributed Recovery Schemes
A Highly-Efficient Technique for Reducing Soft Errors in Static CMOS Circuits

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Parallel Processing for Scientific Computing (Software, Environments and Tools)

Parallel Processing for Scientific Computing (Software, Environments and Tools)
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
PIN: a binary instrumentation tool for computer architecture research and education

WCAE '04 Proceedings of the 2004 workshop on Computer architecture education: held in conjunction with the 31st International Symposium on Computer Architecture
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
Fault injection framework for system resilience evaluation: fake faults for finding future failures

Proceedings of the 2009 workshop on Resiliency in high performance
Architecting phase change memory as a scalable dram alternative

Proceedings of the 36th annual international symposium on Computer architecture
Best-effort parallel execution framework for Recognition and mining applications

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
PDRAM: a hybrid PRAM and DRAM main memory system

Proceedings of the 46th Annual Design Automation Conference
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Hybrid Checkpointing for MPI Jobs in HPC Environments

ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
A Skeletal-Based Approach for the Development of Fault-Tolerant SPMD Applications

PDCAT '10 Proceedings of the 2010 International Conference on Parallel and Distributed Computing, Applications and Technologies
Page placement in hybrid memory systems

Proceedings of the international conference on Supercomputing
Characterizing the impact of soft errors on iterative methods in scientific computing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Matrix Multiplication on GPUs with On-Line Fault Tolerance

ISPA '11 Proceedings of the 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exploring the vulnerability of CMPs to soft errors with 3D stacked non-volatile memory

ICCD '11 Proceedings of the 2011 IEEE 29th International Conference on Computer Design
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Combining Partial Redundancy and Checkpointing for HPC

ICDCS '12 Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems

Using unreliable virtual hardware to inject errors in extreme-scale systems

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detecting silent data corruption through data dynamic monitoring for scientific applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extreme-scale scientific applications are at a significant risk of being hit by soft errors on supercomputers as the scale of these systems and the component density continues to increase. In order to better understand the specific soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool - BIFIT - that allows us to evaluate how soft errors impact applications. In particular, BIFIT is designed with capability to inject faults at very specific targets: an arbitrarily-chosen execution point and any specific data structure. We apply BIFIT to three mission-critical scientific applications and investigate the applications vulnerability to soft errors by performing thousands of statistical tests. We, then, classify each applications individual data structures based on their sensitivity to these vulnerabilities, and generalize these classifications across applications. Subsequently, these classifications can be used to apply appropriate resiliency solutions to each data structure within an application. Our study reveals that these scientific applications have a wide range of sensitivities to both the time and the location of a soft error; yet, we are able to identify intrinsic relationships between application vulnerabilities and specific types of data objects. In this regard, BIFIT enables new opportunities for future resiliency research.