The Fault Detection Problem

Authors:
Andreas Haeberlen;Petr Kuznetsov
Affiliations:
Max Planck Institute for Software Systems (MPI-SWS),;Deutsche Telekom Laboratories, TU Berlin,
Venue:
OPODIS '09 Proceedings of the 13th International Conference on Principles of Distributed Systems
Year:
2009

Citing 14
Cited 2

On the minimal synchronism needed for distributed consensus

Journal of the ACM (JACM)
An Intrusion-Detection Model

IEEE Transactions on Software Engineering - Special issue on computer security and privacy
Asynchronous byzantine agreement protocols

Information and Computation
Shifting gears: changing algorithms on the fly to expedite Byzantine agreement

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Knowledge and common knowledge in a distributed environment

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
The weakest failure detector for solving consensus

Journal of the ACM (JACM)
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fault Detection for Byzantine Quorum Systems

IEEE Transactions on Parallel and Distributed Systems
Practical byzantine fault tolerance and proactive recovery

ACM Transactions on Computer Systems (TOCS)
Secure untrusted data repository (SUNDR)

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Tolerating byzantine faults in transaction processing systems using commit barrier scheduling

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
PeerReview: practical accountability for distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles

Secure network provenance

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Towards privacy-preserving fault detection

Proceedings of the 9th Workshop on Hot Topics in Dependable Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most important challenges in distributed computing is ensuring that services are correct and available despite faults. Recently it has been argued that fault detection can be factored out from computation, and that a generic fault detection service can be a useful abstraction for building distributed systems. However, while fault detection has been extensively studied for crash faults, little is known about detecting more general kinds of faults. This paper explores the power and the inherent costs of generic fault detection in a distributed system. We propose a formal framework that allows us to partition the set of all faults that can possibly occur in a distributed computation into several fault classes . Then we formulate the fault detection problem for a given fault class, and we show that this problem can be solved for only two specific fault classes, namely omission faults and commission faults . Finally, we derive tight lower bounds on the cost of solving the problem for these two classes in asynchronous message-passing systems.