Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Fast parallel algorithms for short-range molecular dynamics
Journal of Computational Physics
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
A Power-Aware Run-Time System for High-Performance Computing
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Adaptive, transparent frequency and voltage scaling of communication phases in MPI programs
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CPU MISER: A Performance-Directed, Run-Time System for Power-Aware Clusters
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Modeling the Impact of Checkpoints on Next-Generation Systems
MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
2-step algorithm for enhancing effectiveness of sender-based message logging
SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Adagio: making DVS practical for complex HPC applications
Proceedings of the 23rd international conference on Supercomputing
International Journal of High Performance Computing Applications
Energy-Efficient Cluster Computing via Accurate Workload Characterization
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Compiler-enhanced incremental checkpointing for OpenMP applications
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A Large-Scale Study of Failures in High-Performance Computing Systems
IEEE Transactions on Dependable and Secure Computing
libhashckpt: hash-based incremental checkpointing using GPU's
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ICPP '12 Proceedings of the 2012 41st International Conference on Parallel Processing
Green Queue: Customized Large-Scale Clock Frequency Scaling
CGC '12 Proceedings of the 2012 Second International Conference on Cloud and Green Computing
Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems
SBAC-PAD '12 Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing
Strategies for Energy-Efficient Resource Management of Hybrid Programming Models
IEEE Transactions on Parallel and Distributed Systems
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Optimizing Checkpoints Using NVM as Virtual Memory
IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
Keeping checkpointing viable for exascale systems
Keeping checkpointing viable for exascale systems
Hi-index | 0.00 |
The U. S. Department of Energy has identified resilience and energy consumption as key challenges for future extreme-scale systems. All checkpoint/restart methods require I/O to local or remote storage. Efforts are under way to minimize the amount of data movement and increase scalability. Nevertheless, the energy consumed by fault resilience methods will increase with system size. It is therefore important to understand the performance overhead in conjunction with the energy consumption of each fault resilience method. In this paper we explore throttling CPU power consumption during I/O intensive checkpoint operations of real applications. We find that 10% total energy savings are possible with little impact on application time to solution.