A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems

Authors:
Tongchit Tantikul;D. Manivannan
Affiliations:
Computer Science Department, University of Kentucky, Lexington, KY;Computer Science Department, University of Kentucky, Lexington, KY
Venue:
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Year:
2004

Citing 6
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Optimistic Recovery in Multi-Threaded Distributed Systems

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Selective Checkpointing and Rollbacks in Multithreaded Distributed Systems

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Checkpointing and recovery in traditional distributed systems is relatively well established. However, checkpointing and recovery in multithreaded distributed systems has not been studied in the literature. Using the traditional checkpointing and recovery algorithms in multithreaded systems leads to false causality problem and high checkpointing overhead. The checkpointing algorithm is implemented at the process level to reduce number of checkpoints and the recovery algorithm is implemented at the thread level which minimizes the false causality problem. The algorithm also takes advantage of the communication-induced checkpointing method to reduce the message overhead.