Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Agent design patterns: elements of agent application design
AGENTS '98 Proceedings of the second international conference on Autonomous agents
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Distributed constraint optimization for medical appointment scheduling
Proceedings of the fifth international conference on Autonomous agents
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Distributed Intelligent Agents
IEEE Expert: Intelligent Systems and Their Applications
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Consistency Issues in Distributed Checkpoints
IEEE Transactions on Software Engineering
Concurrent Robust Checkpointing and Recovery in Distributed Systems
Proceedings of the Fourth International Conference on Data Engineering
FANTOMAS: Fault Tolerance for Mobile Agents in Clusters
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
ECOOP '98 Workshop ion on Object-Oriented Technology
A Sentinel Approach to Fault Handling in Multi-Agent Systems
Revised Papers from the Second Australian Workshop on Distributed Artificial Intelligence: Multi-Agent Systems: Methodologies and Applications
Autonomous Agents and Multi-Agent Systems
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Security and Reliability in Concordia
HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 7 - Volume 7
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
The Adaptive Agent Architecture: Achieving Fault-Tolerance Using Persistent Broker Teams
ICMAS '00 Proceedings of the Fourth International Conference on MultiAgent Systems (ICMAS-2000)
Cloning-Based Checkpoint for Localized Recovery
ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks
System structure for software fault tolerance
IEEE Transactions on Software Engineering
Towards Zero-Delay Recovery of Agents in Production Automation Systems
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
Hi-index | 0.00 |
Distributed multi-agent systems are usually large-scale, involving a large number of agents and messages. Existing checkpoint and recovery strategies are not quite favorable to such systems due to either global recovery spread or runtime logging overhead associated with these strategies. This paper presents our work on the design of correct and efficient checkpoint and recovery strategies for distributed agent systems. The initial part of the paper introduces a formal model to capture the correctness of recovery that is applicable in general, including those used by existing techniques such as deterministic as well as non-deterministic, and single as well as simultaneous recoveries. In particular, notions of atomic and quasi-atomic recovery blocks are introduced to capture the subset of events nullified in a single recovery. It is proved that the correctness of multiple recoveries is guaranteed if a recovery technique ensures well-ordering of corresponding recovery blocks. The rest of the paper utilizes the features of agent communication protocols towards the design of a simple and efficient checkpoint protocol. In particular, agents interact with each other via well-defined agent communication protocols. Agent protocol sessions are group-based and all message interactions are localized inside such groups. A group checkpoint strategy is proposed that uses these features of locality and well-structured-ness for the purpose of both reducing runtime overhead and minimizing recovery spread. The resulted protocol creates strong and asynchronous group checkpoints, i.e., without any explicit kernel message or runtime message logging. An accompanying recovery protocol uses a notion of a protocol dependency graph to identify the minimal quasi-atomic recovery block corresponding to single or simultaneous agent crashes. Correctness of the recovery protocol is proved under the formal model. The paper concludes with a discussion on the significance and contrast of our research with other related works, followed by future research directions.