Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
How to recover efficiently and asynchronously when optimism fails
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Active Optimistic Message Logging for Reliable Execution of MPI Applications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An agent framework for agent societies
Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
Hi-index | 0.00 |
Message logging is an attractive solution to provide fault tolerance for message passing applications because it is more scalable than coordinated checkpointing. Sender-based message logging is a well known optimization that allows to save messages payload in the sender memory and so only the events corresponding to message receptions have to be logged reliably using an event logger. In existing work on message logging, the event logger has always been considered as a centralized process, limiting message logging protocols scalability. In this paper, we propose a distributed event logger. This new event logger takes advantage of multi-cores processors to be executed in parallel with application processes. It makes use of the nodes' volatile memory to save events reliably. We propose a simple gossip-based dissemination protocol to make application processes aware of new stable events. We evaluated our distributed event logger in the Open MPI library with an optimistic and a pessimistic message logging protocol. Experiments show that distributed event logging improves message logging protocols scalability.