Distributed Diskless Checkpoint for Large Scale Systems

Authors:
Leonardo Arturo Bautista Gomez;Naoya Maruyama;Franck Cappello;Satoshi Matsuoka
Affiliations:
-;-;-;-
Venue:
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Year:
2010

Citing 12
Cited 8

Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems

Scalable diskless checkpointing for large parallel systems
Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications

NCA '06 Proceedings of the Fifth IEEE International Symposium on Network Computing and Applications
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing

HASE '08 Proceedings of the 2008 11th IEEE High Assurance Systems Engineering Symposium
A performance evaluation and examination of open-source erasure coding libraries for storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Exascale algorithms for generalized MPI_comm_split

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Parallel reduction to hessenberg form with algorithm-based fault tolerance

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

In high performance computing (HPC), the applications are periodically check pointed to stable storage to increase the success rate of long executions. Nowadays, the overhead imposed by disk-based checkpoint is about 20% of execution time and in the next years it will be more than 50% if the checkpoint frequency increases as the fault frequency increases. Diskless checkpoint has been introduced as a solution to avoid the IO bottleneck of disk-based checkpoint. However, the encoding time, the dedicated resources (the spares) and the memory overhead imposed by diskless checkpoint are significant obstacles against its adoption. In this work, we address these three limitations: 1) we propose a fault tolerant model able to tolerate up to 50% of process failures with a low check pointing overhead 2) our fault tolerance model works without spare node, while still guarantying high reliability, 3) we use solid state drives to significantly increase the checkpoint performance and avoid the memory overhead of classic diskless checkpoint.