Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers

  • Authors:
  • Azzedine Boukerche;Alba Cristina Magalhaes Alves De Melo

  • Affiliations:
  • University of Ottawa, Canada;University of Brasilia, Brasilia, Brazil

  • Venue:
  • Journal of Experimental Algorithmics (JEA)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Distributed shared memory (DSM) creates an abstraction of a physical shared memory that parallel programmers can access. Most recent software DSM systems provide relaxed-memory models that guarantee consistency only at synchronization operations, such as locks and barriers. As the main goal of DSM systems is to provide support for long-term computation-intensive applications, checkpointing and recovery mechanisms are highly desirable. This article presents and evaluates the integration of a coordinated checkpointing mechanism to the barrier primitive that is usually provided with many DSM systems. Our results on some popular benchmarks and a real parallel application show that the overhead introduced during the failure-free execution is often small.