Performance evaluation of consistent recovery protocols using MPICH-GF

Authors:
Namyoon Woo;Hyungsoo Jung;Dongin Shin;Hyuck Han;Heon Y. Yeom;Taesoon Park
Affiliations:
School of Computer Science and Engineering, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea;Department of Computer Engineering, Sejong University, Seoul, Korea
Venue:
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Year:
2005

Citing 17
Cited 1

Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
The Hector Distributed Run-Time Environment

IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
The Cost of Recovery in Message Logging Protocols

IEEE Transactions on Knowledge and Data Engineering
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Application Recovery in Parallel Programming Environment

Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface

Journal of Parallel and Distributed Computing - Special issue on computational grids
Integrating fault-tolerance techniques in grid applications

Integrating fault-tolerance techniques in grid applications
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

SHIELD: a fault-tolerant MPI for an infiniband cluster

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an implementation of several consistent recovery protocols at the abstract device level and their performance comparison. We have performed experiments using three NAS Parallel Benchmark applications with class C datasets on state of the art equipment. The interesting result is that causal message logging protocol has the most expensive recovery cost with communication intensive applications since it suffers from concentrated overload of simultaneous message replaying. Receiver-based optimistic message logging has the least recovery cost with drawback of extensive disk access overhead in failure-free executions. Coordinated checkpointing seems the most practical choice among them.