Modeling the Impact of Checkpoints on Next-Generation Systems

Authors:
Ron A. Oldfield;Sarala Arunagiri;Patricia J. Teller;Seetharami Seelam;Maria Ruiz Varela;Rolf Riesen;Philip C. Roth
Affiliations:
Sandia National Laboratories;The University of Texas at El Paso, USA;The University of Texas at El Paso, USA;IBM TJ Watson Research Center, USA;The University of Texas at El Paso, USA;Sandia National Laboratories;Oak Ridge National Laboratory
Venue:
MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Year:
2007

Citing 0
Cited 21

Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Examples of in transit visualization

Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities
High end scientific codes with computational I/O pipelines: improving their end-to-end performance

Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities
In-situ I/O processing: a case for location flexibility

Proceedings of the sixth workshop on Parallel Data Storage
Simulating application resilience at exascale

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Euripus: a flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Software persistent memory

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
On the Path to Exascale

International Journal of Distributed Systems and Technologies
The viability of using compression to decrease message log sizes

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Replication for send-deterministic MPI HPC applications

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
FaulTM: error detection and recovery using hardware transactional memory

Proceedings of the Conference on Design, Automation and Test in Europe
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Trilinos I/O Support Trios

Scientific Programming - A New Overview of the Trilinos Project --Part 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaflop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to fault tolerance that allow continuous computing with minimal impact on application scalability.