FTI: high performance fault tolerance interface for hybrid systems

Authors:
Leonardo Bautista-Gomez;Seiji Tsuboi;Dimitri Komatitsch;Franck Cappello;Naoya Maruyama;Satoshi Matsuoka
Affiliations:
Tokyo Institute of Technology, INRIA;JAMSTEC;University of Toulouse;INRIA, University of Illinois;Tokyo Institute of Technology;Tokyo Institute of Technology
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 25
Cited 19

Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
A first order approximation to the optimum checkpoint interval

Communications of the ACM
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems

Scalable diskless checkpointing for large parallel systems
Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications

NCA '06 Proceedings of the Fifth IEEE International Symposium on Network Computing and Applications
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Modeling the Impact of Checkpoints on Next-Generation Systems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Scalable performance of the Panasas parallel file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing

HASE '08 Proceedings of the 2008 11th IEEE High Assurance Systems Engineering Symposium
A performance evaluation and examination of open-source erasure coding libraries for storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA

Journal of Parallel and Distributed Computing
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Distributed Diskless Checkpoint for Large Scale Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A GPU accelerated storage system

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Resilience for collaborative applications on clouds: fault-tolerance for distributed HPC applications

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Design and modeling of a non-blocking checkpointing system

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Forward and adjoint simulations of seismic wave propagation on emerging large-scale GPU architectures

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Energy efficiency vs. performance of the numerical solution of PDEs: An application study on a low-power ARM-based cluster

Journal of Computational Physics
A 1 PB/s file system to checkpoint three million MPI tasks

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Replication for send-deterministic MPI HPC applications

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing
A 'cool' way of improving the reliability of HPC machines

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ACR: automatic checkpoint/restart for soft and hard error protection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Failure prediction for HPC systems and applications: Current situation and open issues

International Journal of High Performance Computing Applications
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012
Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.