Impact of over-decomposition on coordinated checkpoint/rollback protocol

Authors:
Xavier Besseron;Thierry Gautier
Affiliations:
Dept. of Computer Science and Engineering, The Ohio State University;MOAIS Project, INRIA, France
Venue:
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Year:
2011

Citing 15
Cited 0

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Processor allocation in multiprogrammed distributed-memory parallel computer systems

Journal of Parallel and Distributed Computing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Athapascan-1: On-Line Building Data Flow Graph in a Parallel Language

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Using Overdecomposition to Overlap Communication Latencies with Computation and Take Advantage of SMT Processors

ICPPW '06 Proceedings of the 2006 International Conference Workshops on Parallel Processing
KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors

Proceedings of the 2007 international workshop on Parallel symbolic computation
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing

IEEE Transactions on Dependable and Secure Computing
Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallelizing dense and banded linear algebra libraries using SMPSs

Concurrency and Computation: Practice & Experience
Unifying UPC and MPI runtimes: experience with MVAPICH

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: our experiments show that the execution time after restart can be reduced by 42 %. We also consider a partial restart protocol to reduce the amount of lost work in case of failure by tracking the task dependencies inside processes. In some cases and thanks to over-decomposition, this partial restart time can represent only 54 % of the global restart time.