Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Fast parallel algorithms for short-range molecular dynamics
Journal of Computational Physics
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
ACM Computing Surveys (CSUR)
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Modeling the Impact of Checkpoints on Next-Generation Systems
MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
2-step algorithm for enhancing effectiveness of sender-based message logging
SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the viability of checkpoint compression for extreme scale fault tolerance
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Hi-index | 0.00 |
Fault-tolerance and its associated overheads are of great concern for current and future extreme-scale systems. The dominant mechanism used today, coordinated checkpoint/restart, places great demands on the I/O system and the method requires frequent synchronization. Uncoordinated checkpointing with message logging addresses many of these limitations at the cost of increasing the storage needed to hold message logs. These storage requirements are critical to the scalability of extreme-scale systems. In this paper, we investigate the viability of using standard compression algorithms to reduce message log sizes for a number of key high-performance computing workloads. Using these workloads we show that, while not be a universal solution for all applications, compression has the potential to significantly reduce message log sizes for a great number of important workloads.