On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
New advances in chemistry and materials science with CPMD and parallel computing
Parallel Computing - computational chemistry
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
System-Level Versus User-Defined Checkpointing
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Practical performance portability in the Parallel Ocean Program (POP): Research Articles
Concurrency and Computation: Practice & Experience - The High Performance Architectural Challenge: Mass Market versus Proprietary Components?
Hybrid Multigrid/Schwarz Algorithms for the Spectral Element Method
Journal of Scientific Computing
A Performance Model of the Parallel Ocean Program
International Journal of High Performance Computing Applications
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
Petascale system management experiences
LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Hi-index | 0.00 |
Providing fault tolerance in high-end petascale systems, consisting of millions of hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing fault tolerance in such high-end systems. Considerable research has focussed on optimizing checkpointing; however, in practice, checkpointing still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by various applications running on leadership-class machines like the IBM Blue Gene/P at Argonne National Laboratory. In addition to studying popular applications, we design a methodology to help users understand and intelligently choose an optimal checkpointing frequency to reduce the overall checkpointing overhead incurred. In particular, we study the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, the Nek5000 computational fluid dynamics application and the Parallel Ocean Program applicationâ聙聰and analyze their memory usage and possible checkpointing trends on 65,536 processors of the Blue Gene/P system.