Probability, statistics, and queueing theory with computer science applications
Probability, statistics, and queueing theory with computer science applications
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system
Communications of the ACM
Horus: a flexible group communication system
Communications of the ACM
On the impossibility of group membership
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
Theoretical Computer Science
On Quiescent Reliable Communication
SIAM Journal on Computing
Reliable Distributed Computing with the ISIS Toolkit
Reliable Distributed Computing with the ISIS Toolkit
Probabilistic Clock Synchronization in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Non blocking atomic commitment with an unreliable failure detector
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
A Fail-Aware Membership Service
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Accelerated Heartbeat Protocols
ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Failure Detectors in Omission Failure Environments
Failure Detectors in Omission Failure Environments
The ensemble system
On the quality of service of failure detectors
On the quality of service of failure detectors
Failure detection and consensus in the crash-recovery model
Distributed Computing
On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems
DISC '02 Proceedings of the 16th International Conference on Distributed Computing
On implementing omega with weak reliability and synchrony assumptions
Proceedings of the twenty-second annual symposium on Principles of distributed computing
ALTER: first step towards dependable grids
Proceedings of the 2006 ACM symposium on Applied computing
Evaluation of the QoS of crash-recovery failure detection
Proceedings of the 2007 ACM symposium on Applied computing
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Scheduling distributable real-time threads in the presence of crash failures and message losses
Proceedings of the 2008 ACM symposium on Applied computing
Fast Scheduling of Distributable Real-Time Threads with Assured End-to-End Timeliness
Ada-Europe '08 Proceedings of the 13th Ada-Europe international conference on Reliable Software Technologies
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Journal of Parallel and Distributed Computing
Diet: new developments and recent results
Euro-Par'06 Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processing
Fuzzy-grey prediction based dynamic failure detector for distributed systems
ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
NN-SA based dynamic failure detector for services composition in distributed environment
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications - Volume Part II
Error detection framework for complex software systems
EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
OS-level hang detection in complex software systems
International Journal of Critical Computer-Based Systems
Experimental evaluation of a failure detection service based on a gossip strategy
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Quantitative evaluation of distributed algorithms using the neko framework: the nekostat extension
LADC'05 Proceedings of the Second Latin-American conference on Dependable Computing
LADC'05 Proceedings of the Second Latin-American conference on Dependable Computing
Modeling and evaluating the survivability of an intrusion tolerant database system
ESORICS'06 Proceedings of the 11th European conference on Research in Computer Security
Operating system support to detect application hangs
VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
Metronome: operating system level performance management via self-adaptive computing
Proceedings of the 49th Annual Design Automation Conference
Adaptare: Supporting automatic and dependable adaptation in dynamic environments
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Timeout-based adaptive consensus: improving performance through adaptation
Proceedings of the 27th Annual ACM Symposium on Applied Computing
On affirmative adaptive failure detection
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
A Failure Detection System for Large Scale Distributed Systems
International Journal of Distributed Systems and Technologies
Hi-index | 14.98 |
Editor's Note: This paper unfortunately contains some errors which led to the paper being reprinted in the May 2002 issue. Please see IEEE Transactions on Computers, vol. 51, no. 5, May 2002, pp. 561-580 for the correct paper (available without subscription).We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies 1) how fast the failure detector detects actual failures and 2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we suggest some ways to make our failure detector adaptiveto changes in the probabilistic behavior of the network.