On scalable and efficient distributed failure detectors
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Building a dependable system from a legacy application with CORBA
Journal of Systems Architecture: the EUROMICRO Journal
The Timely Computing Base Model and Architecture
IEEE Transactions on Computers
DISC '01 Proceedings of the 15th International Conference on Distributed Computing
DISC '00 Proceedings of the 14th International Conference on Distributed Computing
Failure Detection and Membership Management in Grid Environments
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Automated Online Monitoring of Distributed Applications through External Monitors
IEEE Transactions on Dependable and Secure Computing
A new adaptive accrual failure detector for dependable distributed systems
Proceedings of the 2007 ACM symposium on Applied computing
Analysis of Restart Mechanisms in Software Systems
IEEE Transactions on Software Engineering
ARCS'07 Proceedings of the 20th international conference on Architecture of computing systems
Implementation and performance evaluation of an adaptable failure detector in iSCSI
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
The failure detector abstraction
ACM Computing Surveys (CSUR)
Autonomous and scalable failure detection in distributed systems
International Journal of Autonomous and Adaptive Communications Systems
SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
An efficient reliable architecture for application layer anycast service
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Robust network supercomputing with malicious processes
DISC'06 Proceedings of the 20th international conference on Distributed Computing
An architectural framework for detecting process hangs/crashes
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Performance tuning of failure detectors in wireless ad-hoc networks: modelling and experiments
EPEW'05/WS-FM'05 Proceedings of the 2005 international conference on European Performance Engineering, and Web Services and Formal Methods, international conference on Formal Techniques for Computer Systems and Business Processes
Hi-index | 0.00 |
We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies (a) how fast the failure detector detects actual failures, and (b) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements, and we show how this can be done even if the probabilistic behavior of the system is not known. Finally, we briefly explain how to make our failure detector adaptive, so that it automatically reconfigures itself when there is a change in the probabilistic behavior of the network.