Application monitoring and checkpointing in HPC: looking towards exascale systems

Authors:
William M. Jones;John T. Daly;Nathan DeBardeleben
Affiliations:
Coastal Carolina University, Conway, SC;Center for Exceptional Computing, ACS, Fort Meade, MD;High Performance Computing, Los Alamos National Laboratory, Los Alamos, MN
Venue:
Proceedings of the 50th Annual Southeast Regional Conference
Year:
2012

Citing 12
Cited 1

A first order approximation to the optimum checkpoint interval

Communications of the ACM
BProc: the Beowulf distributed process space

ICS '02 Proceedings of the 16th international conference on Supercomputing
Characterization of Bandwidth-Aware Meta-Schedulers for Co-Allocating Jobs Across Multiple Clusters

The Journal of Supercomputing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Application Resilience: Making Progress in Spite of Failure

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems

Concurrency and Computation: Practice & Experience - Special Issue: Advanced Strategies in Grid Environments
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)

AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing. We demonstrate the impact of sub-optimal checkpoint intervals on application efficiency via simulation with real workload data. We find that application efficiency is relatively insensitive to error in estimation of an application's mean time to interrupt (AMTTI), a parameter central to calculating the optimal checkpoint interval. This result corroborates the trends predicted by previous analytical models. We also find that erring on the side of overestimation may be preferable to underestimation. We further discuss how application monitoring and resilience frameworks can benefit from this insensitivity to error in AMTTI estimates. Finally, we discuss the importance of application monitoring at exascale and conclude with a discussion of challenges faced in the use of checkpointing at such extreme scales.