Dynamic fault tolerance in distributed simulation system

Authors:
Min Ma;Shiyao Jin;Chaoqun Ye;Xiaojian Liu
Affiliations:
School of Computer Science, National University of Defense Technology, Hunan, Changsha, China;School of Computer Science, National University of Defense Technology, Hunan, Changsha, China;School of Computer Science, National University of Defense Technology, Hunan, Changsha, China;School of Computer Science, National University of Defense Technology, Hunan, Changsha, China
Venue:
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part I
Year:
2006

Citing 6
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Replicated objects in time warp simulations

WSC '92 Proceedings of the 24th conference on Winter simulation
Fault-tolerant distributed simulation

PADS '98 Proceedings of the twelfth workshop on Parallel and distributed simulation
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
NAP: Practical Fault-Tolerance for Itinerant Computations

NAP: Practical Fault-Tolerance for Itinerant Computations
Design and Evaluation of a Fault-Tolerant Mobile-Agent System

IEEE Intelligent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed simulation system is widely used for forecasting, decision-making and scientific computing. Multi-agent and Grid have been used as platform for simulation. In order to survive from software or hardware failures and guarantee successful rate during agent migrating, system must solve the fault tolerance problem. Classic fault tolerance technology like checkpoint and redundancy can be used for distributed simulation system, but is not efficient. We present a novel fault tolerance protocol which combines the causal message logging method and prime-backup technology. The proposed protocol uses iterative backup location scheme and adaptive update interval to reduce overhead and balance the cost of fault tolerance and recovery time. The protocol has characteristics of no orphan state, and do not need the survival agents to rollback. Most important is that the recovery scheme can tolerant concurrently failures, even the permanent failure of single node. Correctness of the protocol is proved and experiments show the protocol is efficient.