An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
Efficient application migration under compiler guidance
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Hi-index | 0.00 |
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures in that it allows applications to periodically save their state and restart the computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains by far the most popular approach because of its superior performance. This paper focuses on improving the performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis is shown to significantly reduce checkpoint sizes (upto 78%) and to enable asynchronous checkpointing.