FRASystem: fault tolerant system using agents in distributed computing systems

Authors:
Hwamin Lee;Doosoon Park;Heonchang Yu;Giyeol Lee
Affiliations:
Division of Computer Science and Engineering, Soonchunhyang University, Asan-si, Korea 336-745;Division of Computer Science and Engineering, Soonchunhyang University, Asan-si, Korea 336-745;Dept. of Computer Science Education, Korea University, Seoul, Korea;Research and Development Center, Saman Corporation, Anyang, Korea 431-050
Venue:
Cluster Computing
Year:
2011

Citing 18
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Software agents

Communications of the ACM
KQML as an agent communication language

CIKM '94 Proceedings of the third international conference on Information and knowledge management
Understanding the message logging paradigm for masking process crashes

Understanding the message logging paradigm for masking process crashes
Agent sourcebook

Agent sourcebook
Towards a fault-tolerant multi-agent system architecture

AGENTS '00 Proceedings of the fourth international conference on Autonomous agents
Distributed systems (3rd ed.): concepts and design

Distributed systems (3rd ed.): concepts and design
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Distributed Systems: Principles and Paradigms

Distributed Systems: Principles and Paradigms
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Garbage collection in message passing distributed systems

PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
The Cost of Recovery in Message Logging Protocols

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
A Modular Approach to Fault-Tolerant Broadcasts and Related Problems

A Modular Approach to Fault-Tolerant Broadcasts and Related Problems
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication

Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
A survey on sensor networks

IEEE Communications Magazine

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a fault tolerant and recovery system called FRASystem (Fault Tolerant & Recovery Agent System) using multi-agent in distributed computing systems. Previous rollback-recovery protocols were dependent on an inherent communication and an underlying operating system, which caused a decline of computing performance. We propose a rollback-recovery protocol that works independently on an operating system and leads to an increasing portability and extensibility. We define four types of agents: (1) a recovery agent performs a rollback-recovery protocol after a failure, (2) an information agent constructs domain knowledge as a rule of fault tolerance and information during a failure-free operation, (3) a facilitator agent controls the communication between agents, (4) a garbage collection agent performs garbage collection of the useless fault tolerance information. Since agent failures may lead to inconsistent states of a system and a domino effect, we propose an agent recovery algorithm. A garbage collection protocol addresses the performance degradation caused by the increment of saved fault tolerance information in a stable storage. We implemented a prototype of FRASystem using Java and CORBA and experimented the proposed rollback-recovery protocol. The simulations results indicate that the performance of our protocol is better than previous rollback-recovery protocols which use independent checkpointing and pessimistic message logging without using agents. Our contributions are as follows: (1) this is the first rollback-recovery protocol using agents, (2) FRASystem is not dependent on an operating system, and (3) FRASystem provides a portability and extensibility.