Quasi-atomic recovery for distributed agents

  • Authors:
  • Hon F. Li;Zunce Wei;Dhrubajyoti Goswami

  • Affiliations:
  • Department of Computer Science, Concordia University, Montreal, Quebec, Canada H3G 1M8;Department of Computer Science, Concordia University, Montreal, Quebec, Canada H3G 1M8;Department of Computer Science, Concordia University, Montreal, Quebec, Canada H3G 1M8

  • Venue:
  • Parallel Computing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Distributed multi-agent systems are usually large-scale, involving a large number of agents and messages. Existing checkpoint and recovery strategies are not quite favorable to such systems due to either global recovery spread or runtime logging overhead associated with these strategies. This paper presents our work on the design of correct and efficient checkpoint and recovery strategies for distributed agent systems. The initial part of the paper introduces a formal model to capture the correctness of recovery that is applicable in general, including those used by existing techniques such as deterministic as well as non-deterministic, and single as well as simultaneous recoveries. In particular, notions of atomic and quasi-atomic recovery blocks are introduced to capture the subset of events nullified in a single recovery. It is proved that the correctness of multiple recoveries is guaranteed if a recovery technique ensures well-ordering of corresponding recovery blocks. The rest of the paper utilizes the features of agent communication protocols towards the design of a simple and efficient checkpoint protocol. In particular, agents interact with each other via well-defined agent communication protocols. Agent protocol sessions are group-based and all message interactions are localized inside such groups. A group checkpoint strategy is proposed that uses these features of locality and well-structured-ness for the purpose of both reducing runtime overhead and minimizing recovery spread. The resulted protocol creates strong and asynchronous group checkpoints, i.e., without any explicit kernel message or runtime message logging. An accompanying recovery protocol uses a notion of a protocol dependency graph to identify the minimal quasi-atomic recovery block corresponding to single or simultaneous agent crashes. Correctness of the recovery protocol is proved under the formal model. The paper concludes with a discussion on the significance and contrast of our research with other related works, followed by future research directions.