Extending the concept of component interfaces: experience with the Integrated Plasma Simulator
Proceedings of the 2009 Workshop on Component-Based High Performance Computing
Selective Recovery from Failures in a Task Parallel Programming Model
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Towards production monitoring of application progress
Proceedings of the 4th International Workshop on Software Engineering for Computational Science and Engineering
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching
Proceedings of the 8th ACM International Conference on Computing Frontiers
A case for virtual machine based fault injection in a high-performance computing environment
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Can checkpoint/restart mechanisms benefit from hierarchical data staging?
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
Hi-index | 0.00 |
Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that lets applications run with minimal performance degradation.