Multiprogrammed non-blocking checkpoints in support of optimistic simulation on myrinet clusters

  • Authors:
  • Andrea Santoro;Francesco Quaglia

  • Affiliations:
  • Dipartimento di Informatica e Sistemistica, Universití di Roma "La Sapienza", Via Salaria 113, 00198 Roma, Italy;Dipartimento di Informatica e Sistemistica, Universití di Roma "La Sapienza", Via Salaria 113, 00198 Roma, Italy

  • Venue:
  • Journal of Systems Architecture: the EUROMICRO Journal
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

CCL (checkpointing and communication library) is a software layer in support of optimistic parallel discrete event simulation (PDES) on myrinet-based COTS clusters. Beyond classical low latency message delivery functionalities, this library implements CPU offloaded, non-blocking (asynchronous) checkpointing functionalities based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. These functionalities are unique since optimistic simulation systems conventionally rely on checkpointing implemented as a synchronous, CPU-based data copy. Releases of CCL up to v2.4 only support monoprogrammed non-blocking checkpoints. This forces re-synchronization between CPU and DMA activities, which is a potential source of overhead, each time a new checkpoint request must be issued at the simulation application level while the last issued one is still being carried out by the DMA engine. In this paper we present a redesigned release of CCL (v3.0) that, exploiting hardware capabilities of more advanced myrinet clusters, supports multiprogrammed non-blocking checkpoints. The multiprogrammed approach allows higher degree of concurrency between checkpointing and other simulation specific operations carried out by the CPU, with benefits on performance. We also report the results of the experimental evaluation of those benefits for the case of a Personal Communication System (PCS) simulation application, selected as a real world test-bed.