Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Processor allocation in multiprogrammed distributed-memory parallel computer systems
Journal of Parallel and Distributed Computing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Athapascan-1: On-Line Building Data Flow Graph in a Parallel Language
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
ICPPW '06 Proceedings of the 2006 International Conference Workshops on Parallel Processing
KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors
Proceedings of the 2007 international workshop on Parallel symbolic computation
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
IEEE Transactions on Dependable and Secure Computing
Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes
PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallelizing dense and banded linear algebra libraries using SMPSs
Concurrency and Computation: Practice & Experience
Unifying UPC and MPI runtimes: experience with MVAPICH
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Hi-index | 0.00 |
Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: our experiments show that the execution time after restart can be reduced by 42 %. We also consider a partial restart protocol to reduce the amount of lost work in case of failure by tracking the task dependencies inside processes. In some cases and thanks to over-decomposition, this partial restart time can represent only 54 % of the global restart time.