Quasi-atomic recovery for distributed agents

Authors:
Hon F. Li;Zunce Wei;Dhrubajyoti Goswami
Affiliations:
Department of Computer Science, Concordia University, Montreal, Quebec, Canada H3G 1M8;Department of Computer Science, Concordia University, Montreal, Quebec, Canada H3G 1M8;Department of Computer Science, Concordia University, Montreal, Quebec, Canada H3G 1M8
Venue:
Parallel Computing
Year:
2006

Citing 23
Cited 1

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Agent design patterns: elements of agent application design

AGENTS '98 Proceedings of the second international conference on Autonomous agents
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Distributed constraint optimization for medical appointment scheduling

Proceedings of the fifth international conference on Autonomous agents
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Distributed Intelligent Agents

IEEE Expert: Intelligent Systems and Their Applications
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Consistency Issues in Distributed Checkpoints

IEEE Transactions on Software Engineering
Concurrent Robust Checkpointing and Recovery in Distributed Systems

Proceedings of the Fourth International Conference on Data Engineering
FANTOMAS: Fault Tolerance for Mobile Agents in Clusters

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Agent Roles and Aspects

ECOOP '98 Workshop ion on Object-Oriented Technology
A Sentinel Approach to Fault Handling in Multi-Agent Systems

Revised Papers from the Second Australian Workshop on Distributed Artificial Intelligence: Multi-Agent Systems: Methodologies and Applications
Using Domain-Independent Exception Handling Services to Enable Robust Open Multi-Agent Systems: The Case of Agent Death

Autonomous Agents and Multi-Agent Systems
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Security and Reliability in Concordia

HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 7 - Volume 7
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
The Adaptive Agent Architecture: Achieving Fault-Tolerance Using Persistent Broker Teams

ICMAS '00 Proceedings of the Fourth International Conference on MultiAgent Systems (ICMAS-2000)
Cloning-Based Checkpoint for Localized Recovery

ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks
System structure for software fault tolerance

IEEE Transactions on Software Engineering

Towards Zero-Delay Recovery of Agents in Production Automation Systems

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed multi-agent systems are usually large-scale, involving a large number of agents and messages. Existing checkpoint and recovery strategies are not quite favorable to such systems due to either global recovery spread or runtime logging overhead associated with these strategies. This paper presents our work on the design of correct and efficient checkpoint and recovery strategies for distributed agent systems. The initial part of the paper introduces a formal model to capture the correctness of recovery that is applicable in general, including those used by existing techniques such as deterministic as well as non-deterministic, and single as well as simultaneous recoveries. In particular, notions of atomic and quasi-atomic recovery blocks are introduced to capture the subset of events nullified in a single recovery. It is proved that the correctness of multiple recoveries is guaranteed if a recovery technique ensures well-ordering of corresponding recovery blocks. The rest of the paper utilizes the features of agent communication protocols towards the design of a simple and efficient checkpoint protocol. In particular, agents interact with each other via well-defined agent communication protocols. Agent protocol sessions are group-based and all message interactions are localized inside such groups. A group checkpoint strategy is proposed that uses these features of locality and well-structured-ness for the purpose of both reducing runtime overhead and minimizing recovery spread. The resulted protocol creates strong and asynchronous group checkpoints, i.e., without any explicit kernel message or runtime message logging. An accompanying recovery protocol uses a notion of a protocol dependency graph to identify the minimal quasi-atomic recovery block corresponding to single or simultaneous agent crashes. Correctness of the recovery protocol is proved under the formal model. The paper concludes with a discussion on the significance and contrast of our research with other related works, followed by future research directions.