Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Message Logging in Mobile Computing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Sender-based message logging for reducing rollback propagation
SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Checkpointing and communication pattern-neutral algorithm for removing messages logged by senders
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Hi-index | 0.00 |
To continuously log messages in the limited volatile memories of their sending processes, existing SBML protocols force the processes to periodically flush the message log into the stable storage or messages in the log to be useless for future failures and then removes them. But, these garbage collection algorithms may result in a large number of stable storage accesses or high communication and checkpointing overheads as inter-process communication rate increases. To address this problem, we propose an efficient algorithm to autonomously remove useless log information in its volatile storage by piggybacking only some additional information. It requires no extra message and forced checkpoint. Additionally, the algorithm efficiently supports fast commit of all output to the outside world. Simulation results show that our algorithm considerably outperforms the traditional algorithm with respect to the average elapsed time required until the memory buffer for message logging of a process is full.