A first order approximation to the optimum checkpoint interval
Communications of the ACM
Evaluation of checkpoint mechanisms for massively parallel machines
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Stable Checkpointing in Distributed Systems without Shared Disks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Design space exploration for 3D architectures
ACM Journal on Emerging Technologies in Computing Systems (JETC)
Modeling the Impact of Checkpoints on Next-Generation Systems
MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Compiler-enhanced incremental checkpointing for OpenMP applications
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Corona: System Implications of Emerging Nanophotonic Technology
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Accurate, Pre-RTL Temperature-Aware Design Using a Parameterized, Geometric Thermal Model
IEEE Transactions on Computers
Storage-class memory: the next storage system technology
IBM Journal of Research and Development
PowerNap: eliminating server idle power
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Cooperative checkpointing theory
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Emerging technologies and their impact on system design
Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Phase change memory architecture and the quest for scalability
Communications of the ACM
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
ACM Transactions on Architecture and Code Optimization (TACO)
Enabling architectural innovations using non-volatile memory
Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI
Page placement in hybrid memory systems
Proceedings of the international conference on Supercomputing
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Using active NVRAM for I/O staging
Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities
Proceedings of the 26th ACM international conference on Supercomputing
Proceedings of the 39th Annual International Symposium on Computer Architecture
NV-process: a fault-tolerance process model based on non-volatile memory
Proceedings of the Asia-Pacific Workshop on Systems
NV-process: a fault-tolerance process model based on non-volatile memory
APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Leveraging phase change memory to achieve efficient virtual machine execution
Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Practical nonvolatile multilevel-cell phase change memory
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.02 |
The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.