FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
Scaling MPI to short-memory MPPs such as BG/L
Proceedings of the 20th annual international conference on Supercomputing
Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Symmetric active/active metadata service for high availability parallel file systems
Journal of Parallel and Distributed Computing
International Journal of Parallel Programming
See applications run and throughput jump: The case for redundant computing in HPC
DSNW '10 Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Replication for send-deterministic MPI HPC applications
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
Extreme-scale parallel systems will require alternative methods for applications to maintain current levels of uninterrupted execution. Redundant computation is one approach to consider, if the benefits of increased resiliency outweigh the cost of consuming additional resources. We describe a transparent redundancy approach for MPI applications and detail two different implementations that provide the ability to tolerate a range of failure scenarios, including loss of application processes and connectivity. We compare these two approaches and show performance results from micro-benchmarks that bound worstcase message passing performance degradation. We propose several enhancements that could lower the overhead of providing resiliency through redundancy.