Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Communications of the ACM
KQML as an agent communication language
CIKM '94 Proceedings of the third international conference on Information and knowledge management
Understanding the message logging paradigm for masking process crashes
Understanding the message logging paradigm for masking process crashes
Agent sourcebook
Towards a fault-tolerant multi-agent system architecture
AGENTS '00 Proceedings of the fourth international conference on Autonomous agents
Distributed systems (3rd ed.): concepts and design
Distributed systems (3rd ed.): concepts and design
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Distributed Systems: Principles and Paradigms
Distributed Systems: Principles and Paradigms
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Garbage collection in message passing distributed systems
PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
The Cost of Recovery in Message Logging Protocols
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
A Modular Approach to Fault-Tolerant Broadcasts and Related Problems
A Modular Approach to Fault-Tolerant Broadcasts and Related Problems
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
IEEE Communications Magazine
Hi-index | 0.00 |
In this paper, we present a fault tolerant and recovery system called FRASystem (Fault Tolerant & Recovery Agent System) using multi-agent in distributed computing systems. Previous rollback-recovery protocols were dependent on an inherent communication and an underlying operating system, which caused a decline of computing performance. We propose a rollback-recovery protocol that works independently on an operating system and leads to an increasing portability and extensibility. We define four types of agents: (1) a recovery agent performs a rollback-recovery protocol after a failure, (2) an information agent constructs domain knowledge as a rule of fault tolerance and information during a failure-free operation, (3) a facilitator agent controls the communication between agents, (4) a garbage collection agent performs garbage collection of the useless fault tolerance information. Since agent failures may lead to inconsistent states of a system and a domino effect, we propose an agent recovery algorithm. A garbage collection protocol addresses the performance degradation caused by the increment of saved fault tolerance information in a stable storage. We implemented a prototype of FRASystem using Java and CORBA and experimented the proposed rollback-recovery protocol. The simulations results indicate that the performance of our protocol is better than previous rollback-recovery protocols which use independent checkpointing and pessimistic message logging without using agents. Our contributions are as follows: (1) this is the first rollback-recovery protocol using agents, (2) FRASystem is not dependent on an operating system, and (3) FRASystem provides a portability and extensibility.