Linux Support for Fast Transparent General Purpose Checkpoint/Restart of Multithreaded Processes in Loadable Kernel Module

  • Authors:
  • Amirreza Zarrabi;Khairulmizam Samsudin;Wan Azizun Wan Adnan

  • Affiliations:
  • Department of Computer and Communication Systems Engineering, Faculty of Engineering, University Putra Malaysia, Selangor, Malaysia 43400;Department of Computer and Communication Systems Engineering, Faculty of Engineering, University Putra Malaysia, Selangor, Malaysia 43400;Department of Computer and Communication Systems Engineering, Faculty of Engineering, University Putra Malaysia, Selangor, Malaysia 43400

  • Venue:
  • Journal of Grid Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Checkpoint/Restart is the ability to save the state of a running application so that it can later resume its execution from the time of the checkpoint. These are techniques with many potential applications, including establishment of a fault-tolerant environment, improving system resource utilization, and true migration of a process. With increasing hardware speed and size of clusters the average time between failures has been reduced. Therefore, fault tolerance and ability to checkpoint a process have become inevitable. Almost all platforms deployed for high-performance computing support process checkpoint/restart. Linux as one of the popular operating systems does not provide a general purpose implementation. Some are limited to specific type of parallel programming library, confined to some unique well-behaved type of applications, or reliant on specific features in kernel which could be missing on many occasions. Most of implementations demand elaborate practice of recompiling a whole kernel to apply required patches. In this paper, we describe the design and implementation of multithreaded process checkpoint/restart system for Linux which provide capability of dynamic extension to increase compatibility and reduce system overhead. It does not impose any requirement on the existence of a special facility in the operating system and can do checkpoint/restart of an application independent of their behavior and fully transparent. The entire system is absolutely implemented in multiple kernel loadable modules, which result in ease of use and eliminate the burden of complex system administration.