The viability of using compression to decrease message log sizes

Authors:
Kurt B. Ferreira;Rolf Riesen;Dorian Arnold;Dewan Ibtesham;Ron Brightwell
Affiliations:
Sandia National Laboratories, Albuquerque, NM;IBM Research, Ireland;University of New Mexico, Albuquerque, NM;University of New Mexico, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM
Venue:
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Year:
2012

Citing 15
Cited 0

Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Optimistic replication

ACM Computing Surveys (CSUR)
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Modeling the Impact of Checkpoints on Next-Generation Systems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
2-step algorithm for enhancing effectiveness of sender-based message logging

SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the viability of checkpoint compression for extreme scale fault tolerance

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault-tolerance and its associated overheads are of great concern for current and future extreme-scale systems. The dominant mechanism used today, coordinated checkpoint/restart, places great demands on the I/O system and the method requires frequent synchronization. Uncoordinated checkpointing with message logging addresses many of these limitations at the cost of increasing the storage needed to hold message logs. These storage requirements are critical to the scalability of extreme-scale systems. In this paper, we investigate the viability of using standard compression algorithms to reduce message log sizes for a number of key high-performance computing workloads. Using these workloads we show that, while not be a universal solution for all applications, compression has the potential to significantly reduce message log sizes for a great number of important workloads.