PCI-DMA/CPU Handoff for Increased Effectiveness of Checkpointing Functionalities in CCL

  • Authors:
  • Andrea Santoro;Francesco Quaglia

  • Affiliations:
  • -;-

  • Venue:
  • DS-RT '03 Proceedings of the Seventh IEEE International Symposium on Distributed Simulation and Real-Time Applications
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Checkpointing and Communication Library (CCL) is a recently developed software in support of optimistic parallel discrete event simulation on myrinet clusters. Beyond low latency message delivery functionalities, CCL also offers non-blocking checkpointing functionalities supported by a programmable PCI DMA engine on board of myrinet cards. CCL employs a re-synchronization functionality between PCI DMA activities and CPU activities to maintainthe consistency of checkpointed information (i.e. to prevent the CPU from updating information that still needs to be copied through DMAing). If re-synchronization is invoked before the checkpoint operation is completed, simulation activities carried out by the CPU may be forced to wait for checkpoint completion. Since data copy through the PCI DMA is slower than what achievable with the CPU, in pathological situations a re-synchronization period maylast more than a whole checkpoint operation performed by the CPU, thus nullifying the potential benefit from offloading checkpointing from the CPU. This paper tackles such an issue by presenting the design and implementation of a handoff mechanism of checkpoint operations between PCI DMA and CPU to enhance the effectiveness of checkpointing functionalities offered by CCL. Although a checkpoint operation is initially entrusted to the PCI DMA, whenever re-synchronization forces the simulation application to wait for its completion, the checkpoint operation is dynamically switched to the CPU, namely the fastest available device, since its timely completion has become a performance critical task for the simulation application.