On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications

Authors:
Thomas Ropars;Amina Guermouche;Bora Uçar;Esteban Meneses;Laxmikant V. Kalé;Franck Cappello
Affiliations:
INRIA Saclay-Île de France, France;INRIA Saclay-Île de France, France and Université Paris-Sud;CNRS and ENS Lyon, France;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;INRIA Saclay-Île de France, France and University of Illinois at Urbana-Champaign
Venue:
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Year:
2011

Citing 12
Cited 3

Optimal attack and reinforcement of a network

Journal of the ACM (JACM)
Finding good approximate vertex and edge partitions is NP-hard

Information Processing Letters
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Communication characteristics of large-scale scientific applications for contemporary cluster architectures

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Hybrid checkpointing for parallel applications in cluster federations

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery

Concurrency and Computation: Practice & Experience
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
A model for predicting the optimum checkpoint interval for restart dumps

ICCS'03 Proceedings of the 2003 international conference on Computational science
Team-Based Message Logging: Preliminary Results

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Communication patterns

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

Towards an energy estimator for fault tolerance protocols

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Energy efficiency in high-performance computing with and without knowledge of applications and services

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clusters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message passing HPC applications to show that partial message logging is suitable in most cases. We propose a partitioning algorithm to find suitable clusters of processes given the communication pattern of an application. Finally, we evaluate the efficiency of partial message logging using two state of the art protocols on a set of representative applications.