Complexity of network synchronization
Journal of the ACM (JACM)
Theoretical Computer Science
Efficient algorithms for distributed snapshots and global virtual time approximation
Journal of Parallel and Distributed Computing - Special issue on parallel and discrete event simulation
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Proceedings of the fourth annual ACM symposium on Principles of distributed computing
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Concurrent and Distributed Computing in Java
Concurrent and Distributed Computing in Java
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Fast memory snapshot for concurrent programmingwithout synchronization
Proceedings of the 23rd international conference on Supercomputing
Brief announcement: a decentralized algorithm for distributed trigger counting
DISC'10 Proceedings of the 24th international conference on Distributed computing
Modeling and analyzing periodic distributed computations
SSS'10 Proceedings of the 12th international conference on Stabilization, safety, and security of distributed systems
An efficient decentralized algorithm for the distributed trigger counting problem
ICDCN'11 Proceedings of the 12th international conference on Distributed computing and networking
A global snapshot collection algorithm with concurrent initiators with non-FIFO channel
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Modeling, analyzing and slicing periodic distributed computations
Information and Computation
Hi-index | 0.00 |
Existing algorithms for global snapshots in distributed systems are not scalable when the underlying topology is complete. In a network with N processors, these algorithms require O(N) space and O(N) messages per processor. As a result, these algorithms are not efficient in large systems when the logical topology of the communication layer such as MPI is complete. In this paper, we propose three algorithms for global snapshot: a grid-based, a tree-based and a centralized algorithm. The grid-based algorithm uses O(N) space but only O(√N) messages per processor. The tree-based algorithm requires only O(1) space and O(logNlog w) messages per processor where w is the average number of messages in transit per processor. The centralized algorithm requires only O(1) space and O(log w) messages per processor. We also have a matching lower bound for this problem. Our algorithms have applications in checkpointing, detecting stable predicates and implementing synchronizers. We have implemented our algorithms on top of the MPI library on the Blue Gene/L supercomputer. Our experiments confirm that the proposed algorithms significantly reduce the message and space complexity of a global snapshot.