Efficient commit protocols for the tree of processes model of distributed transactions
ACM SIGOPS Operating Systems Review
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
International Journal of High Performance Computing Applications
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
An evaluation of user-level failure mitigation support in MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Hi-index | 0.00 |
In a constant effort to deliver steady performance improvements, the size of High Performance Computing (HPC) systems, as observed by the Top 500 ranking1, has grown tremendously over the last decade. This trend, along with the resultant decrease of the Mean Time Between Failure (MTBF), is unlikely to stop; thereby many computing nodes will inevitably fail during application execution [5]. It is alarming that most popular fault tolerant approaches see their efficiency plummet at Exascale [3,4], calling for more efficient approaches evolving around application centric failure mitigation strategies [7].