Checkpointing Distributed Shared Memory

  • Authors:
  • Luis M. Silva;João Gabriel Silva

  • Affiliations:
  • Departamento Engenharia Informatica, Universidade de Coimbra, POLO II-Vila Franca, P-3030-Coimbra, Portugal, Email: luis@dei.uc.pt;Departamento Engenharia Informatica, Universidade de Coimbra, POLO II-Vila Franca, P-3030-Coimbra, POLO II-Vila Franca, P-3030-Coimbra, Portugal, Email: jgabriel@dei.uc.pt

  • Venue:
  • The Journal of Supercomputing - Special issue: high performance distributed computing
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

Distributed shared memory (DSM) is a very promising programming modelfor exploiting the parallelism of distributed memory systems, becauseit provides a higher level of abstraction than simple message passing.Although the nodes of standard distributed systems exhibit high crashrates only very few DSM environments have some kind of support forfault-tolerance.In this article, we present a checkpointing mechanism for a DSM systemthat is efficient and portable. It offers some portability because itis built on top of MPI and uses only the services offered by MPI and aPOSIX compliant local file system.As far as we know, this is the first real implementation of such ascheme for DSM. Along with the description of the algorithm we presentexperimental results obtained in a cluster of workstations. We hopethat our research shows that efficient, transparent and portablecheckpointing is viable for DSM systems.