Fault tolerance in the mobile environment

Authors:
Daniel C. Doolan;Sabin Tabirca;Laurence T. Yang
Affiliations:
School of Computing, Robert Gordon University, Aberdeen, United Kingdom;Department of Computer Science, University College Cork, Cork, Ireland;Department of Computer Science, St. Francis Xavier University, Antigonish, NS, Canada
Venue:
Journal of Mobile Multimedia
Year:
2009

Citing 4
Cited 0

CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Mobile Parallel Computing

ISPDC '06 Proceedings of the Proceedings of The Fifth International Symposium on Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In general it is assumed that a parallel program will execute on reliable hardware. A fault tolerant program and underlying infrastructure should be capable of surviving failures such as system crashes and network failures. At the highest level the application should be capable of automatically recovering from a set of faults without any change to the apparent behaviour of the program. The process of checkpointing may be used to allow a program to save its state to persistent storage, abort and restart from the checkpoint. Several fault tolerant MPI implementations are currently in existence, for example MPICH-V is considered to be one of the most complete, featuring checkpointing and message logs to allow aborted processes to be replaced. No matter how sophisticated a fault tolerant system may be, it can never be completely relied upon, as there is always the possibility of a complete system failure. It is one thing to develop fault tolerant applications on high end dedicated clusters and supercomputers, however applying fault tolerance to the realm of mobile parallel computing presents an entire new series of challenges that are inexorably linked with the unpredictable nature of wireless communication systems. Two differing strategies for fault tolerance in the mobile Bluetooth wireless environment will be presented and compared to see which should be adopted over another.