Dynamic fault tolerance in distributed simulation system

  • Authors:
  • Min Ma;Shiyao Jin;Chaoqun Ye;Xiaojian Liu

  • Affiliations:
  • School of Computer Science, National University of Defense Technology, Hunan, Changsha, China;School of Computer Science, National University of Defense Technology, Hunan, Changsha, China;School of Computer Science, National University of Defense Technology, Hunan, Changsha, China;School of Computer Science, National University of Defense Technology, Hunan, Changsha, China

  • Venue:
  • ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part I
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Distributed simulation system is widely used for forecasting, decision-making and scientific computing. Multi-agent and Grid have been used as platform for simulation. In order to survive from software or hardware failures and guarantee successful rate during agent migrating, system must solve the fault tolerance problem. Classic fault tolerance technology like checkpoint and redundancy can be used for distributed simulation system, but is not efficient. We present a novel fault tolerance protocol which combines the causal message logging method and prime-backup technology. The proposed protocol uses iterative backup location scheme and adaptive update interval to reduce overhead and balance the cost of fault tolerance and recovery time. The protocol has characteristics of no orphan state, and do not need the survival agents to rollback. Most important is that the recovery scheme can tolerant concurrently failures, even the permanent failure of single node. Correctness of the protocol is proved and experiments show the protocol is efficient.