ACM Transactions on Programming Languages and Systems (TOPLAS)
Replicated objects in time warp simulations
WSC '92 Proceedings of the 24th conference on Winter simulation
Fault-tolerant distributed simulation
PADS '98 Proceedings of the twelfth workshop on Parallel and distributed simulation
Fault-tolerant distributed simulation
WSC '91 Proceedings of the 23rd conference on Winter simulation
The resource sharing system: dynamic federate mapping for HLA-based distributed simulation
Proceedings of the fifteenth workshop on Parallel and distributed simulation
Elements of distributed computing
Elements of distributed computing
Parallel and Distribution Simulation Systems
Parallel and Distribution Simulation Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
A fault detection service for wide area distributed computations
Cluster Computing
Recovering from Multiple Process Failures in the Time Warp Mechanism
IEEE Transactions on Computers
Concepts for dependable distributed discrete event simulation
Proceedings of the 14th European Simulation Multiconference on Simulation and Modelling: Enablers for a Better Quality of Life
SIMULATION OF PACKET COMMUNICATION ARCHITECTURE COMPUTER SYSTEMS
SIMULATION OF PACKET COMMUNICATION ARCHITECTURE COMPUTER SYSTEMS
The Anatomy of the Grid: Enabling Scalable Virtual Organizations
International Journal of High Performance Computing Applications
A Version of MASM Portable Across Different UNIX Systems and Different Hardware Architectures
DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
A Framework for Robust HLA-based Distributed Simulations
Proceedings of the 20th Workshop on Principles of Advanced and Distributed Simulation
A framework for fault-tolerance in HLA-based distributed simulations
WSC '05 Proceedings of the 37th conference on Winter simulation
Distributed Simulation: A Case Study in Design and Verification of Distributed Programs
IEEE Transactions on Software Engineering
DS-RT '07 Proceedings of the 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Federate Migration in a Service Oriented HLA RTI
DS-RT '07 Proceedings of the 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications
A decoupled federate architecture for high level architecture-based distributed simulation
Journal of Parallel and Distributed Computing
A replication structure for efficient and fault-tolerant parallel and distributed simulations
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Hi-index | 0.00 |
A large scale HLA-based simulation (federation) is composed of a large number of simulation components (federates), which may be developed by different participants and executed at different locations. These federates are subject to failures due to various reasons. What is worse, the risk of federation failure increases with the number of federates in the federation. In this paper, a fault tolerance mechanism is proposed to tolerate the crash-stop failures of federates. By exploiting the decoupled federate architecture, federate failures can be masked from the federation and recovery can take place without interrupting the executions of other federates. A basic state recovery protocol is first proposed to recover the state of the failed federate relying on the checkpoint and message logging taken before the failure. Then, an optimized protocol is further developed to accelerate the state recovery procedure. Experiments are carried out to verify that the proposed mechanism provides correct failure recovery. The experimental results also indicate that the optimized protocol can outperform the basic one considerably.