Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems
Software—Practice & Experience
A first order approximation to the optimum checkpoint interval
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Evaluation of checkpoint mechanisms for massively parallel machines
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Stable Checkpointing in Distributed Systems without Shared Disks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
An Experimental Study about Diskless Checkpointing
EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
Design space exploration for 3D architectures
ACM Journal on Emerging Technologies in Computing Systems (JETC)
Modeling the Impact of Checkpoints on Next-Generation Systems
MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Compiler-enhanced incremental checkpointing for OpenMP applications
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Corona: System Implications of Emerging Nanophotonic Technology
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Accurate, Pre-RTL Temperature-Aware Design Using a Parameterized, Geometric Thermal Model
IEEE Transactions on Computers
PowerNap: eliminating server idle power
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
A durable and energy efficient main memory using phase change memory technology
Proceedings of the 36th annual international symposium on Computer architecture
International Journal of High Performance Computing Applications
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
PCRAMsim: system-level performance, energy, and area modeling for phase-change ram
Proceedings of the 2009 International Conference on Computer-Aided Design
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Cooperative checkpointing theory
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A content-aware block placement algorithm for reducing PRAM storage bit writes
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Application monitoring and checkpointing in HPC: looking towards exascale systems
Proceedings of the 50th Annual Southeast Regional Conference
Proceedings of the International Conference on Computer-Aided Design
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Phase-change memory: An architectural perspective
ACM Computing Surveys (CSUR)
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
Journal of Parallel and Distributed Computing
Exploring reliability of exascale systems through simulations
Proceedings of the High Performance Computing Symposium
Exploring the future of out-of-core computing with compute-local non-volatile memory
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Horus: fine-grained encryption-based security for large-scale storage
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more vulnerable as the node count keeps increasing, novel techniques that enable fast and frequent checkpointing are critical to the future exascale system implementation. In this work, we first introduce one of the emerging nonvolatile memory technologies, Phase-Change Random Access Memory (PCRAM), as a proper candidate of the fast checkpointing device. After a thorough analysis of MPP systems, failure rates and failure sources, we propose a PCRAM-based hybrid local/global checkpointing mechanism which not only provides a faster checkpoint storage, but also boosts the effectiveness of other orthogonal techniques such as incremental checkpointing and background checkpointing. Three variant implementations of the PCRAM-based hybrid checkpointing are designed to be adopted at different stages and to offer a smooth transition from the conventional in-disk checkpointing to the instant in-memory approach. Analyzing the overhead by using a hybrid checkpointing performance model, we show the proposed approach only incurs less than 3% performance overhead on a projected exascale system.