Measurement-Based Analysis of Error Latency
IEEE Transactions on Computers
Representing circuits more efficiently in symbolic model checking
DAC '91 Proceedings of the 28th ACM/IEEE Design Automation Conference
The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
Observer-A Concept for Formal On-Line Validation of Distributed Systems
IEEE Transactions on Software Engineering
Asynchronous consensus and broadcast protocols
Journal of the ACM (JACM)
Schemes for fault identification in communication networks
IEEE/ACM Transactions on Networking (TON)
High-density reachability analysis
ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm
IEEE Transactions on Computers
Consistent global states of distributed systems: fundamental concepts and mechanisms
Distributed systems (2nd Ed.)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Symbolic Model Checking
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Optimal and Efficient Probabilistic Distributed Diagnosis Schemes
IEEE Transactions on Computers
On Probabilistic Diagnosis of Multiprocessor Systems Using Multiple Syndromes
IEEE Transactions on Parallel and Distributed Systems
Partial-Order Reduction in Symbolic State Space Exploration
CAV '97 Proceedings of the 9th International Conference on Computer Aided Verification
A Framework for Database Audit and Control Flow Checking for a Wireless Telephone Network Controller
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Compositional Approach to Monitoring Distributed Systems
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
ACM Transactions on Computer Systems (TOCS)
GAP: A General Approach to Quantitative Diagnosis of Performance Problems
Journal of Network and Systems Management
How Fail-Stop are Faulty Programs?
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
PRDC '04 Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04)
Self Checking Network Protocols: A Monitor Based Approach
SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
How to Tolerate Half Less One Byzantine Nodes in Practical Distributed Systems
SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Crash-Resilient Time-Free Eventual Leadership
SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Detecting causal relationships in distributed computations: in search of the holy grail
Distributed Computing
Low complexity Byzantine-resilient consensus
Distributed Computing
Magpie: online modelling and performance-aware systems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Proactive recovery in a Byzantine-fault-tolerant system
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Generating tests from UML specifications
UML'99 Proceedings of the 2nd international conference on The unified modeling language: beyond the standard
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
CLUEBOX: a performance log analyzer for automated troubleshooting
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Proceedings of the 6th International COnference
Large-scale app-based reporting of customer problems in cellular networks: potential and limitations
Proceedings of the first ACM SIGCOMM workshop on Measurements up the stack
The Journal of Supercomputing
An approach for failure recognition in IP-based industrial control networks and systems
International Journal of Network Management
Juggling the Jigsaw: towards automated problem inference from network trouble tickets
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
NetCheck: network diagnoses from blackbox traces
NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
In today's world where distributed systems form many of our critical infrastructures, dependability outagesare becoming increasingly common. In many situations, it is necessary to not just detect a failure, but alsoto diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since highthroughput applications with frequent interactions between the different components allow fast errorpropagation. It is desirable to consider applications as black-boxes for the diagnostic process. In thispaper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. TheMonitor only observes the message exchanges between the protocol entities (PEs) remotely and doesnot access internal protocol state. At runtime, it builds a causal graph between the PEs based on theircommunication and uses this together with a rule base of allowed state transition paths to diagnose thefailure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfectcoverage. The hierarchical Monitor framework allows distributed diagnosis handling failures at individualMonitors. The framework is implemented and applied to a reliable multicast protocol executing on ourcampus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency ofthe diagnosis.