On the viability of checkpoint compression for extreme scale fault tolerance

  • Authors:
  • Dewan Ibtesham;Dorian Arnold;Kurt B. Ferreira;Patrick G. Bridges

  • Affiliations:
  • University of New Mexico, Albuquerque, NM;University of New Mexico, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM;University of New Mexico, Albuquerque, NM

  • Venue:
  • Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The increasing size and complexity of high performance computing systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression to reduce checkpoint commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we conclude that checkpoint data compression should be considered as a part of a scalable checkpoint/restart solution and discuss additional scenarios and improvements that may make checkpoint data compression even more viable.