A Failure Detection System for Large Scale Distributed Systems

Authors:
Valentin Cristea;Andrei Lavinia;Ciprian Dobre;Florin Pop
Affiliations:
University Politehnica of Bucharest, Romania;University Politehnica of Bucharest, Romania;University Politehnica of Bucharest, Romania;University Politehnica of Bucharest, Romania
Venue:
International Journal of Distributed Systems and Technologies
Year:
2011

Citing 9
Cited 2

Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
Implementation and Performance Evaluation of an Adaptable Failure Detector

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
An Adaptive Failure Detection Protocol

PRDC '01 Proceedings of the 2001 Pacific Rim International Symposium on Dependable Computing
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Fault Tolerance Using a Front-End Service for Large Scale Distributed Systems

SYNASC '09 Proceedings of the 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing
A fault tolerance approach for distributed systems using monitoring based replication

ICCP '10 Proceedings of the Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing
A dependability layer for large-scale distributed systems

International Journal of Grid and Utility Computing

Testing-Effort Dependent Software Reliability Model for Distributed Systems

International Journal of Distributed Systems and Technologies
Scalable Distributed Two-Layer Data Structures SD2DS

International Journal of Distributed Systems and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. It is also a difficult problem. Resources under heavy loads can be mistaken as being failed. The failure of a network link can be detected by the lack of a response, but this also occurs when a computational resource fails. Although progress has been made, no existing approach provides a system that covers all essential aspects related to a distributed environment. This paper presents a failure detection system based on adaptive, decentralized failure detectors. The system is developed as an independent substrate, working asynchronously and independent of the application flow. It uses a hierarchical protocol, creating a clustering mechanism that ensures a dynamic configuration and traffic optimization. It also uses a gossip strategy for failure detection at local levels to minimize detection time and remove wrong suspicions. Results show that the system scales with the number of monitored resources, while still considering the QoS requirements of both applications and resources.