Tuning of the Checkpointing and Communication Library for Optimistic Simulation on Myrinet Based NOWs

  • Authors:
  • Francesco Quaglia;Andrea Santoro;Bruno Ciciani

  • Affiliations:
  • -;-;-

  • Venue:
  • MASCOTS '01 Proceedings of the Ninth International Symposium in Modeling, Analysis and Simulation of Computer and Telecommunication Systems
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: Recently a Checkpointing and Communication Library (CCL) for optimistic simulation on Myrinet based Network of Workstations (NOWs) has been presented. CCL offloads checkpoint operations from the CPU by charging them to a programmable DMA engine on the Myrinet network card. CCL includes also functionalities for freezing the simulation application on demand, which can be used for data consistency maintenance (for example when a state buffer needs to be accessed for further modifications while a DMA based checkpoint operation involving it is still in progress). Programming the DMA to perform a checkpoint operation by transferring large data blocks in a single burst allows the latency of any checkpoint operation to be kept low. This reduces the probability for application freezing to really occur. On the other hand, transferring large data blocks in a single burst might cause negative interference on communication since that DMA (and other circuitry) cannot be used for communication functionalities until the currently executed data transfer is not yet completed. In this paper we present a detailed identification of the effects of the burst length, from which we outline a set of relevant phenomena to take into account in order to determine a compile time suited value for the burst length itself. We also report measures quantifying these phenomena for the case of a PC cluster. Actually, the data indicate that communication functionalities do not suffer from the use of non-minimal burst lengths for checkpoint operations, thus pointing out how, if well tuned, CCL provides highly effective, CPU offloaded, checkpointing functionalities.