FREM: A Fast Restart Mechanism for General Checkpoint/Restart

  • Authors:
  • Yawei Li;Zhiling Lan

  • Affiliations:
  • Google, Inc;Illinois Institute of Technology, Chicago

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 2011

Quantified Score

Hi-index 14.98

Visualization

Abstract

As failure rate keeps on increasing in large systems, applications running atop restart more frequently than ever. Existing research on checkpoint/restart mainly focuses on optimizing checkpoint operation, without paying much attention to the restart operation. As a result, application restart latency maybe substantial, which greatly threatens system dependability and performance. To attack the restart latency problem, in this paper, we present FREM, a fast restart mechanism for general checkpoint/restart protocols. By dynamically tracking the process data accesses after each checkpoint, FREM masks restart latency by overlapping application recovery with the retrieval of its checkpoint image. We have implemented FREM as a prototype system and tested it under Linux environments. Extensive experiments with real applications demonstrate that it can effectively reduce restart latency by over 50 percent on average, as compared to the conventional restart mechanisms.