DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems

  • Authors:
  • Joseph F. Ruscio;Michael A. Heffner;Srinidhi Varadarajan

  • Affiliations:
  • -;-;-

  • Venue:
  • Proceedings of the 2006 ACM/IEEE conference on Supercomputing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a new fault tolerance system, DejaVu, for transparent and automatic checkpointing, migration and recovery of parallel and distributed applications. DejaVu has several novel features. First, it provides a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without modification to parallel applications or the underlying operating system. Second, it uses a novel instrumentation and state capture mechanism that transparently captures application state. Third, it uses a new runtime mechanism for transparent incremental checkpointing, capturing the least amount of state needed to maintain global consistency. Finally, it provides a novel communication architecture that enables transparent migration of existing MPI codes, without source-code modifications. DejaVu has been implemented for 32 bit and 64 bit Linux platforms on x86 processors interconnected over Infiniband or Gigabit Ethernet networks. Performance results from the production-ready implementation shows less than 5% overhead with real-world parallel applications with large memory footprints.