Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

Authors:
Xiangyu Dong;Yuan Xie;Naveen Muralimanohar;Norman P. Jouppi
Affiliations:
Pennsylvania State University;Pennsylvania State University;Hewlett-Packard Labs;Hewlett-Packard Labs
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2011

Citing 24
Cited 8

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems

Software—Practice & Experience
A first order approximation to the optimum checkpoint interval

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Evaluation of checkpoint mechanisms for massively parallel machines

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Stable Checkpointing in Distributed Systems without Shared Disks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
An Experimental Study about Diskless Checkpointing

EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
Design space exploration for 3D architectures

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Modeling the Impact of Checkpoints on Next-Generation Systems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Compiler-enhanced incremental checkpointing for OpenMP applications

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Corona: System Implications of Emerging Nanophotonic Technology

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Accurate, Pre-RTL Temperature-Aware Design Using a Parameterized, Geometric Thermal Model

IEEE Transactions on Computers
PowerNap: eliminating server idle power

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
A durable and energy efficient main memory using phase change memory technology

Proceedings of the 36th annual international symposium on Computer architecture
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
PCRAMsim: system-level performance, energy, and area modeling for phase-change ram

Proceedings of the 2009 International Conference on Computer-Aided Design
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Cooperative checkpointing theory

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A content-aware block placement algorithm for reducing PRAM storage bit writes

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)

Application monitoring and checkpointing in HPC: looking towards exascale systems

Proceedings of the 50th Annual Southeast Regional Conference
Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration

Proceedings of the International Conference on Computer-Aided Design
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Phase-change memory: An architectural perspective

ACM Computing Surveys (CSUR)
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing
Exploring reliability of exascale systems through simulations

Proceedings of the High Performance Computing Symposium
Exploring the future of out-of-core computing with compute-local non-volatile memory

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Horus: fine-grained encryption-based security for large-scale storage

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more vulnerable as the node count keeps increasing, novel techniques that enable fast and frequent checkpointing are critical to the future exascale system implementation. In this work, we first introduce one of the emerging nonvolatile memory technologies, Phase-Change Random Access Memory (PCRAM), as a proper candidate of the fast checkpointing device. After a thorough analysis of MPP systems, failure rates and failure sources, we propose a PCRAM-based hybrid local/global checkpointing mechanism which not only provides a faster checkpoint storage, but also boosts the effectiveness of other orthogonal techniques such as incremental checkpointing and background checkpointing. Three variant implementations of the PCRAM-based hybrid checkpointing are designed to be adopted at different stages and to offer a smooth transition from the conventional in-disk checkpointing to the instant in-memory approach. Analyzing the overhead by using a hybrid checkpointing performance model, we show the proposed approach only incurs less than 3% performance overhead on a projected exascale system.