On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation
IEEE Transactions on Parallel and Distributed Systems
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Evaluation of fault-tolerant policies using simulation
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Software Challenges for Extreme Scale Computing: Going From Petascale to Exascale Systems
International Journal of High Performance Computing Applications
A model for predicting the optimum checkpoint interval for restart dumps
ICCS'03 Proceedings of the 2003 international conference on Computational science
Pageserver: High-Performance SSD-Based Checkpointing of Transactional Distributed Memory
ICCEA '10 Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 01
Enhancing Checkpoint Performance with Staging IO and SSD
SNAPI '10 Proceedings of the 2010 International Workshop on Storage Network Architecture and Parallel I/Os
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
ACM Transactions on Architecture and Code Optimization (TACO)
GPFS: a shared-disk file system for large computing clusters
FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
A New Diskless Checkpointing Approach for Multiple Processor Failures
IEEE Transactions on Dependable and Secure Computing
Making a case for distributed file systems at Exascale
Proceedings of the third international workshop on Large-scale system and application performance
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale
Proceedings of the High Performance Computing Symposium
Hi-index | 0.00 |
Exascale computers are predicted to emerge by the end of this decade with millions of nodes and billions of concurrent cores/threads. One of the most critical challenges for exascale computing is how to effectively and efficiently maintain the system reliability. Checkpointing is the state-of-the-art technique for high-end computing system reliability that has proved to work well for current petascale scales. This paper investigates the suitability of checkpointing mechanism for exascale computers, across both parallel filesystems and distributed filesystems. We built a model to emulate exascale systems, and developed a simulator, RXSim, to study its reliability and efficiency. Experiments show that the overall system efficiency and availability would go towards zero as system scales approach exascale with checkpointing mechanism on parallel filesystems. However, the simulations suggest that a distributed filesystem with local persistent storage would offer excellent scalability and aggregate bandwidth, enabling efficient checkpointing at exascale.