Optimal attack and reinforcement of a network
Journal of the ACM (JACM)
Finding good approximate vertex and edge partitions is NP-hard
Information Processing Letters
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Hybrid checkpointing for parallel applications in cluster federations
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery
Concurrency and Computation: Practice & Experience
International Journal of High Performance Computing Applications
A model for predicting the optimum checkpoint interval for restart dumps
ICCS'03 Proceedings of the 2003 international conference on Computational science
Team-Based Message Logging: Preliminary Results
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Towards an energy estimator for fault tolerance protocols
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clusters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message passing HPC applications to show that partial message logging is suitable in most cases. We propose a partitioning algorithm to find suitable clusters of processes given the communication pattern of an application. Finally, we evaluate the efficiency of partial message logging using two state of the art protocols on a set of representative applications.