A model of roll-back recovery with multiple checkpoints
ICSE '76 Proceedings of the 2nd international conference on Software engineering
A Case of Multi-Level Distributed Recovery Schemes
A Case of Multi-Level Distributed Recovery Schemes
A Highly-Efficient Technique for Reducing Soft Errors in Static CMOS Circuits
ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Parallel Processing for Scientific Computing (Software, Environments and Tools)
Parallel Processing for Scientific Computing (Software, Environments and Tools)
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
PIN: a binary instrumentation tool for computer architecture research and education
WCAE '04 Proceedings of the 2004 workshop on Computer architecture education: held in conjunction with the 31st International Symposium on Computer Architecture
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
Fault injection framework for system resilience evaluation: fake faults for finding future failures
Proceedings of the 2009 workshop on Resiliency in high performance
Architecting phase change memory as a scalable dram alternative
Proceedings of the 36th annual international symposium on Computer architecture
Best-effort parallel execution framework for Recognition and mining applications
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
PDRAM: a hybrid PRAM and DRAM main memory system
Proceedings of the 46th Annual Design Automation Conference
International Journal of High Performance Computing Applications
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Hybrid Checkpointing for MPI Jobs in HPC Environments
ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
A Skeletal-Based Approach for the Development of Fault-Tolerant SPMD Applications
PDCAT '10 Proceedings of the 2010 International Conference on Parallel and Distributed Computing, Applications and Technologies
Page placement in hybrid memory systems
Proceedings of the international conference on Supercomputing
Characterizing the impact of soft errors on iterative methods in scientific computing
Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Matrix Multiplication on GPUs with On-Line Fault Tolerance
ISPA '11 Proceedings of the 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exploring the vulnerability of CMPs to soft errors with 3D stacked non-volatile memory
ICCD '11 Proceedings of the 2011 IEEE 29th International Conference on Computer Design
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Combining Partial Redundancy and Checkpointing for HPC
ICDCS '12 Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems
Using unreliable virtual hardware to inject errors in extreme-scale systems
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detecting silent data corruption through data dynamic monitoring for scientific applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Extreme-scale scientific applications are at a significant risk of being hit by soft errors on supercomputers as the scale of these systems and the component density continues to increase. In order to better understand the specific soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool - BIFIT - that allows us to evaluate how soft errors impact applications. In particular, BIFIT is designed with capability to inject faults at very specific targets: an arbitrarily-chosen execution point and any specific data structure. We apply BIFIT to three mission-critical scientific applications and investigate the applications vulnerability to soft errors by performing thousands of statistical tests. We, then, classify each applications individual data structures based on their sensitivity to these vulnerabilities, and generalize these classifications across applications. Subsequently, these classifications can be used to apply appropriate resiliency solutions to each data structure within an application. Our study reveals that these scientific applications have a wide range of sensitivities to both the time and the location of a soft error; yet, we are able to identify intrinsic relationships between application vulnerabilities and specific types of data objects. In this regard, BIFIT enables new opportunities for future resiliency research.