Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
ZOID: I/O-forwarding infrastructure for petascale architectures
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
DataStager: scalable data staging services for petascale applications
Proceedings of the 18th ACM international symposium on High performance distributed computing
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Design and Evaluation of Multiple-Level Data Staging for Blue Gene Systems
IEEE Transactions on Parallel and Distributed Systems
Rebound: scalable checkpointing for coherent shared memory
Proceedings of the 38th annual international symposium on Computer architecture
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
High-Bandwidth remote parallel i/o with the distributed memory filesystem MEMFS
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Integrated in-system storage architecture for high performance computing
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Distributed Virtual Diskless Checkpointing: A Highly Fault Tolerant Scheme for Virtualized Clusters
IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Design and modeling of a non-blocking checkpointing system
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
With the massive scale of high-performance computing systems, long-running scientific parallel applications periodically save the state of their execution to files called checkpoints to recover from system failures. Checkpoints are stored on external parallel file systems, but limited bandwidth makes this a time-consuming operation. Multilevel checkpointing systems, like the Scalable Checkpoint/Restart (SCR) library, alleviate this bottleneck by caching checkpoints in storage located close to the compute nodes. However, most large scale systems do not provide file storage on compute nodes, preventing the use of SCR. We have implemented a novel user-space file system that stores data in main memory and transparently spills over to other storage, like local flash memory or the parallel file system, as needed. This technique extends the reach of libraries like SCR to systems where they otherwise could not be used. Furthermore, we expose file contents for Remote Direct Memory Access, allowing external tools to copy checkpoints to the parallel file system in the background with reduced CPU interruption. Our file system scales linearly with node count and delivers a 1~PB/s throughput at three million MPI processes, which is 20x faster than the system RAM disk and 1000x faster than the parallel file system.