Evaluating energy savings for checkpoint/restart

Authors:
Bryan Mills;Ryan E. Grant;Kurt B. Ferreira;Rolf Riesen
Affiliations:
University of Pittsburgh;Sandia National Laboratories;Sandia National Laboratories;IBM Research - Ireland
Venue:
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Year:
2013

Citing 32
Cited 0

Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Process Hijacking

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
A Power-Aware Run-Time System for High-Performance Computing

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Adaptive, transparent frequency and voltage scaling of communication phases in MPI programs

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CPU MISER: A Performance-Directed, Run-Time System for Power-Aware Clusters

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Modeling the Impact of Checkpoints on Next-Generation Systems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
2-step algorithm for enhancing effectiveness of sender-based message logging

SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Adagio: making DVS practical for complex HPC applications

Proceedings of the 23rd international conference on Supercomputing
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Energy-Efficient Cluster Computing via Accurate Workload Characterization

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Compiler-enhanced incremental checkpointing for OpenMP applications

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A Large-Scale Study of Failures in High-Performance Computing Systems

IEEE Transactions on Dependable and Secure Computing
libhashckpt: hash-based incremental checkpointing using GPU's

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

ICPP '12 Proceedings of the 2012 41st International Conference on Parallel Processing
Green Queue: Customized Large-Scale Clock Frequency Scaling

CGC '12 Proceedings of the 2012 Second International Conference on Cloud and Green Computing
Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

SBAC-PAD '12 Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing
Strategies for Energy-Efficient Resource Management of Hybrid Programming Models

IEEE Transactions on Parallel and Distributed Systems
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Optimizing Checkpoints Using NVM as Virtual Memory

IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
Keeping checkpointing viable for exascale systems

Keeping checkpointing viable for exascale systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The U. S. Department of Energy has identified resilience and energy consumption as key challenges for future extreme-scale systems. All checkpoint/restart methods require I/O to local or remote storage. Efforts are under way to minimize the amount of data movement and increase scalability. Nevertheless, the energy consumed by fault resilience methods will increase with system size. It is therefore important to understand the performance overhead in conjunction with the energy consumption of each fault resilience method. In this paper we explore throttling CPU power consumption during I/O intensive checkpoint operations of real applications. We find that 10% total energy savings are possible with little impact on application time to solution.