Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Replicated objects in time warp simulations
WSC '92 Proceedings of the 24th conference on Winter simulation
Fault-tolerant distributed simulation
PADS '98 Proceedings of the twelfth workshop on Parallel and distributed simulation
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
NAP: Practical Fault-Tolerance for Itinerant Computations
NAP: Practical Fault-Tolerance for Itinerant Computations
Design and Evaluation of a Fault-Tolerant Mobile-Agent System
IEEE Intelligent Systems
Hi-index | 0.00 |
Distributed simulation system is widely used for forecasting, decision-making and scientific computing. Multi-agent and Grid have been used as platform for simulation. In order to survive from software or hardware failures and guarantee successful rate during agent migrating, system must solve the fault tolerance problem. Classic fault tolerance technology like checkpoint and redundancy can be used for distributed simulation system, but is not efficient. We present a novel fault tolerance protocol which combines the causal message logging method and prime-backup technology. The proposed protocol uses iterative backup location scheme and adaptive update interval to reduce overhead and balance the cost of fault tolerance and recovery time. The protocol has characteristics of no orphan state, and do not need the survival agents to rollback. Most important is that the recovery scheme can tolerant concurrently failures, even the permanent failure of single node. Correctness of the protocol is proved and experiments show the protocol is efficient.