Exploring reliability of exascale systems through simulations

Authors:
Dongfang Zhao;Da Zhang;Ke Wang;Ioan Raicu
Affiliations:
Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL and Argonne National Laboratory, Argonne, IL
Venue:
Proceedings of the High Performance Computing Symposium
Year:
2013

Citing 15
Cited 1

On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation

IEEE Transactions on Parallel and Distributed Systems
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Evaluation of fault-tolerant policies using simulation

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Software Challenges for Extreme Scale Computing: Going From Petascale to Exascale Systems

International Journal of High Performance Computing Applications
Architectures for Extreme-Scale Computing

Computer
A model for predicting the optimum checkpoint interval for restart dumps

ICCS'03 Proceedings of the 2003 international conference on Computational science
Pageserver: High-Performance SSD-Based Checkpointing of Transactional Distributed Memory

ICCEA '10 Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 01
Enhancing Checkpoint Performance with Staging IO and SSD

SNAPI '10 Proceedings of the 2010 International Workshop on Storage Network Architecture and Parallel I/Os
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
GPFS: a shared-disk file system for large computing clusters

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
A New Diskless Checkpointing Approach for Multiple Processor Failures

IEEE Transactions on Dependable and Secure Computing
Making a case for distributed file systems at Exascale

Proceedings of the third international workshop on Large-scale system and application performance

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale

Proceedings of the High Performance Computing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exascale computers are predicted to emerge by the end of this decade with millions of nodes and billions of concurrent cores/threads. One of the most critical challenges for exascale computing is how to effectively and efficiently maintain the system reliability. Checkpointing is the state-of-the-art technique for high-end computing system reliability that has proved to work well for current petascale scales. This paper investigates the suitability of checkpointing mechanism for exascale computers, across both parallel filesystems and distributed filesystems. We built a model to emulate exascale systems, and developed a simulator, RXSim, to study its reliability and efficiency. Experiments show that the overall system efficiency and availability would go towards zero as system scales approach exascale with checkpointing mechanism on parallel filesystems. However, the simulations suggest that a distributed filesystem with local persistent storage would offer excellent scalability and aggregate bandwidth, enabling efficient checkpointing at exascale.