A checkpoint-based high availability run-time system for Windows NT clusters

  • Authors:
  • Zhang Youhui;Wang Dongsheng

  • Affiliations:
  • Tsinghua Univ., Beijing, P.R.C, 100084;Tsinghua Univ., Beijing, P.R.C, 100084

  • Venue:
  • ACM SIGOPS Operating Systems Review
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a high availability run-time system----ChaRM-NT, a Checkpoint-based Rollback recovery system for parallel applications on a cluster of computers (COCs) based on Windows NT. ChaRM-NT implements an insert-mode, reduced coordinated checkpointing and rollback recovery (CRR) mechanism. Owing to the above techniques, ChaRM-NT can recover parallel applications from the checkpointing file upon system failures. In addition we have implemented a new coordinated checkpointing algorithm that only requires O(n) control messages where n is the number of participating processes. Independent on message passing environments (MPEs) ChaRM-NT implements a portable single process CRR library. Therefore it is very easy to adapt to different MPEs and it supports PVM and MPI for NT now.