Interconnect agnostic checkpoint/restart in open MPI
Proceedings of the 18th ACM international symposium on High performance distributed computing
Hi-index | 0.00 |
We present a new fault tolerance system, DejaVu, for transparent and automatic checkpointing, migration and recovery of parallel and distributed applications. DejaVu has several novel features. First, it provides a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without modification to parallel applications or the underlying operating system. Second, it uses a novel instrumentation and state capture mechanism that transparently captures application state. Third, it uses a new runtime mechanism for transparent incremental checkpointing, capturing the least amount of state needed to maintain global consistency. Finally, it provides a novel communication architecture that enables transparent migration of existing MPI codes, without source-code modifications. DejaVu has been implemented for 32 bit and 64 bit Linux platforms on x86 processors interconnected over Infiniband or Gigabit Ethernet networks. Performance results from the production-ready implementation shows less than 5% overhead with real-world parallel applications with large memory footprints.