Failure Detection Service for Large Scale Systems

Authors:
Jacek Kobusiński
Affiliations:
Institute of Computing Science, Poznań Universitiy of Technology, Poland
Venue:
KES-AMSTA '07 Proceedings of the 1st KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications
Year:
2007

Citing 20
Cited 0

Epidemic algorithms for replicated database maintenance

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Bimodal multicast

ACM Transactions on Computer Systems (TOCS)
On scalable and efficient distributed failure detectors

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
A fault detection service for wide area distributed computations

Cluster Computing
Peer-to-Peer Membership Management for Gossip-Based Protocols

IEEE Transactions on Computers
Improving the Scalability of Multi-Agent Systems

Revised Papers from the International Workshop on Infrastructure for Multi-Agent Systems: Infrastructure for Agents, Multi-Agent Systems, and Scalable Multi-Agent Systems
Lightweight Probabilistic Broadcast

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Implementation and Performance Evaluation of an Adaptable Failure Detector

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Fault Tolerance in Scalable Agent Support Systems: Integrating DARX in the AgentScape Framework

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Failure Detectors for Large-Scale Distributed Systems

SRDS '02 Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems
An Experimental Evaluation of Domain-Independent Fault Handling Services in Open Multi-Agent Systems

ICMAS '00 Proceedings of the Fourth International Conference on MultiAgent Systems (ICMAS-2000)
DARX—A Framework For The Fault-Tolerant Support Of Agent Software

ISSRE '03 Proceedings of the 14th International Symposium on Software Reliability Engineering
The " Accrual Failure Detector

SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
The peer sampling service: experimental evaluation of unstructured gossip-based implementations

Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Scalable fault tolerant Agent Grooming Environment: SAGE

Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of building a failure detection service for large scale distributed systems, as well as multi-agent systems. It describes the failure detector mechanism and defines the roles it plays in the system. Afterwards, the key construction problems that are fundamental in the context of building the failure detection service are presented. Finally, a sketch of general framework for implementing such a service is described. The proposed failure detection service can be used by mobile agents as a crucial component for building fault-tolerant multi-agent systems.