Self-refined fault tolerance in HPC using dynamic dependent process groups

Authors:
N. P. Gopalan;K. Nagarajan
Affiliations:
Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, TN, India;Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, TN, India
Venue:
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Year:
2005

Citing 7
Cited 0

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a novel method for achieving a distributed self-refined fault tolerance by dynamically partitioning the processes into smaller groups, which are mutually disjoint and collectively exhaustive of the whole system. The present model provides tolerance for frequent faults, makes the roll back recovery simple and less time consuming. An optimal checkpoint interval is found using a mathematical approximation and a spare process is made to capture all the in-transit messages when a process fails at its ends. Piggybacking the events of dependent processes on the outgoing messages is used for process grouping. A process with maximum information can scatter chunk values to the other dependent processes in its group. Each process constructs a checkpoint when the received chunk matches with its log.