Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers

  • Authors:
  • Azzedine Boukerche;Jeferson Koch;Alba Cristina Magalhaes Alves de Melo

  • Affiliations:
  • SITE – School of Information Technology and Engineering, University of Ottawa, Canada;Department of Computer Science, University of Brasilia, Brasilia, DF, Brazil;Department of Computer Science, University of Brasilia, Brasilia, DF, Brazil

  • Venue:
  • WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Distributed Shared Memory (DSM) creates an abstraction of a physical shared memory that parallel programmers can access. Most recent software DSMs provide relaxed memory models that guarantee consistency only at synchronization operations. As the main goal of DSM systems is to provide support for long term computation intensive applications, checkpointing and recovery mechanisms are highly desirable. This article presents and evaluates the integration of a coordinated checkpointing mechanism to the barrier primitive that is usually provided with many DSM systems. Our results on some popular benchmarks and a real parallel application show that the overhead introduced during the failure-free execution is often small.