Failure Detectors as First Class Objects

Authors:
Pascal Felber;Xavier Défago;Rachid Guerraoui;Philipp Oser
Affiliations:
-;-;-;-
Venue:
DOA '99 Proceedings of the International Symposium on Distributed Objects and Applications
Year:
1999

Citing 0
Cited 12

Detection of anomalies in software architecture with connectors

Science of Computer Programming - Special issue on quality system and software architectures
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Grouping algorithms for scalable self-monitoring distributed systems

Autonomics '08 Proceedings of the 2nd International Conference on Autonomic Computing and Communication Systems
Semantic partitioning of peer-to-peer search space

Computer Communications
Design of the notification system for failure detectors

International Journal of High Performance Computing and Networking
Comparative analysis of quality of service and memory usage for adaptive failure detectors in healthcare systems

IEEE Journal on Selected Areas in Communications - Special issue on wireless and pervasive communications for healthcare
Fault-management in P2P-MPI

International Journal of Parallel Programming
Fault management in P2P-MPI

GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
A security management scheme for failure detector distributed systems based on self-tuning control theory

Journal of Intelligent Manufacturing
A case for event-driven distributed objects

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part II
An architectural framework for detecting process hangs/crashes

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Survey: Survey of fault tolerant techniques for grid

Computer Science Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the fundamental differences between a centralized system and a distributed one is the notion of partial failures. The ability to efficiently and accurately detect failures is a key element underlying reliable distributed computing. In current distributed systems however, failure detection is either left to the application developer or hidden from the programmer and provided in an ad hoc manner behind the scene. We plead for an intermediate approach where failure detectors are first class objects. We view failure detection as an abstraction, the complexity of which is encapsulated be-hind well defined interfaces. The various roles of a failure detection service are all represented as first class objects.Following our approach, one can reuse existing failure detection protocols as they are or, through composition or refinement, define new protocols that match the application requirements. We describe an interesting result of a composition that mixes push and pull failure monitoring and we show how scalability issues may be addressed by using a hierarchical failure detection configuration. We also discuss the implementation of our failure service both in CORBA and in Java.