Semi-asynchronous checkpointing for optimistic simulation on a Myrinet based NOW

Authors:
Francesco Quaglia;Andrea Santoro
Affiliations:
Dipartimento di Informatica e Sistemistica, Università di Roma "La Sapienza", Via Salaria 113, 00198 Roma, Italy;Dipartimento di Informatica e Sistemistica, Università di Roma "La Sapienza", Via Salaria 113, 00198 Roma, Italy
Venue:
Proceedings of the fifteenth workshop on Parallel and distributed simulation
Year:
2001

Citing 14
Cited 2

Virtual time

ACM Transactions on Programming Languages and Systems (TOPLAS)
Design and Evaluation of the Rollback Chip: Special Purpose Hardware for Time Warp

IEEE Transactions on Computers
Selecting the checkpoint interval in time warp simulation

PADS '93 Proceedings of the seventh workshop on Parallel and distributed simulation
Adaptive checkpointing in Time Warp

PADS '94 Proceedings of the eighth workshop on Parallel and distributed simulation
Effect of communication overheads on Time Warp performance: an experimental study

PADS '94 Proceedings of the eighth workshop on Parallel and distributed simulation
Comparative analysis of periodic state saving techniques in time warp simulators

PADS '95 Proceedings of the ninth workshop on Parallel and distributed simulation
Event sensitive state saving in time warp parallel discrete event simulations

WSC '96 Proceedings of the 28th conference on Winter simulation
Incremental state saving in SPEEDES using C++

WSC '93 Proceedings of the 25th conference on Winter simulation
An external state management system for optimistic parallel simulation

WSC '93 Proceedings of the 25th conference on Winter simulation
Multiplexed state saving for bounded rollback

Proceedings of the 29th conference on Winter simulation
State saving for interactive optimistic simulation

Proceedings of the eleventh workshop on Parallel and distributed simulation
An Analytical Model for Hybrid Checkpointing in Time Warp Distributed Simulation

IEEE Transactions on Parallel and Distributed Systems
Combining periodic and probabilistic checkpointing in optimistic simulation

PADS '99 Proceedings of the thirteenth workshop on Parallel and distributed simulation
Fast-software-checkpointing in optimistic simulation: embedding state saving into the event routine instructions

PADS '99 Proceedings of the thirteenth workshop on Parallel and distributed simulation

Conditional checkpoint abort: an alternative semantic for re-synchronization in CCL

Proceedings of the sixteenth workshop on Parallel and distributed simulation
Communications and network: benefits from semi-asynchronous checkpointing for time warp simulations of a large state PCS model

Proceedings of the 33nd conference on Winter simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Great effort has been devoted to the design of optimized checkpointing strategies for optimistic parallel discrete event simulators. On the other hand there is less work in the direction to improve the execution mode of any single checkpoint operation. Specifically, checkpoint operations are typically charged to the CPU, thus leading to freezing of the simulation application while checkpointing is in progress, i.e. the execution mode of the checkpointing protocol is typically synchronous. In this paper we focus on improvements of the execution mode and present a software architecture, designed for myrinet based Network of Workstations (NOWs), to avoid application freezing during any checkpoint operation, thus moving the execution itself towards an asynchronous mode. This is done by charging checkpoint operations to a hardware component distinct from the CPU, namely a DMA engine. On the other hand, totally asynchronous checkpointing could suffer from data inconsistency whenever the content of a state buffer is accessed for further modifications while a checkpoint operation involving it is not yet completed. To avoid this, the architecture includes functionalities for resynchronization on demand. We have used these functionalities to implement an execution mode of the checkpointing protocol we refer to as semi-asynchronous. By the results of an experimental study we argue that the semi-asynchronous mode can be an effective solution to almost completely remove the delay associated with any checkpoint operation from the completion time of the simulation.