Automated Rule-Based Diagnosis through a Distributed Monitor System

Authors:
Gunjan Khanna;Mike Yu Cheng;Padma Varadharajan;Saurabh Bagchi;Miguel P. Correia;Paulo J. Veríssimo
Affiliations:
-;-;-;-;-;-
Venue:
IEEE Transactions on Dependable and Secure Computing
Year:
2007

Citing 32
Cited 8

Measurement-Based Analysis of Error Latency

IEEE Transactions on Computers
Representing circuits more efficiently in symbolic model checking

DAC '91 Proceedings of the 28th ACM/IEEE Design Automation Conference
The consensus problem in fault-tolerant computing

ACM Computing Surveys (CSUR)
Observer-A Concept for Formal On-Line Validation of Distributed Systems

IEEE Transactions on Software Engineering
Asynchronous consensus and broadcast protocols

Journal of the ACM (JACM)
Schemes for fault identification in communication networks

IEEE/ACM Transactions on Networking (TON)
High-density reachability analysis

ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm

IEEE Transactions on Computers
Consistent global states of distributed systems: fundamental concepts and mechanisms

Distributed systems (2nd Ed.)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Symbolic Model Checking

Symbolic Model Checking
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Optimal and Efficient Probabilistic Distributed Diagnosis Schemes

IEEE Transactions on Computers
On Probabilistic Diagnosis of Multiprocessor Systems Using Multiple Syndromes

IEEE Transactions on Parallel and Distributed Systems
Partial-Order Reduction in Symbolic State Space Exploration

CAV '97 Proceedings of the 9th International Conference on Computer Aided Verification
A Framework for Database Audit and Control Flow Checking for a Wireless Telephone Network Controller

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Compositional Approach to Monitoring Distributed Systems

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining

ACM Transactions on Computer Systems (TOCS)
GAP: A General Approach to Quantitative Diagnosis of Performance Problems

Journal of Network and Systems Management
How Fail-Stop are Faulty Programs?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Backtracking intrusions

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Failure Handling in a Reliable Multicast Protocol for Improving Buffer Utilization and Accommodating Heterogeneous Receivers

PRDC '04 Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04)
Self Checking Network Protocols: A Monitor Based Approach

SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
How to Tolerate Half Less One Byzantine Nodes in Practical Distributed Systems

SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Crash-Resilient Time-Free Eventual Leadership

SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Detecting causal relationships in distributed computations: in search of the holy grail

Distributed Computing
Low complexity Byzantine-resilient consensus

Distributed Computing
Magpie: online modelling and performance-aware systems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Proactive recovery in a Byzantine-fault-tolerant system

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Generating tests from UML specifications

UML'99 Proceedings of the 2nd international conference on The unified modeling language: beyond the standard

Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
CLUEBOX: a performance log analyzer for automated troubleshooting

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
NEVERMIND, the problem is already fixed: proactively detecting and troubleshooting customer DSL problems

Proceedings of the 6th International COnference
Large-scale app-based reporting of customer problems in cellular networks: potential and limitations

Proceedings of the first ACM SIGCOMM workshop on Measurements up the stack
Data flow analysis for anomaly detection and identification toward resiliency in extreme scale systems

The Journal of Supercomputing
An approach for failure recognition in IP-based industrial control networks and systems

International Journal of Network Management
Juggling the Jigsaw: towards automated problem inference from network trouble tickets

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
NetCheck: network diagnoses from blackbox traces

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In today's world where distributed systems form many of our critical infrastructures, dependability outagesare becoming increasingly common. In many situations, it is necessary to not just detect a failure, but alsoto diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since highthroughput applications with frequent interactions between the different components allow fast errorpropagation. It is desirable to consider applications as black-boxes for the diagnostic process. In thispaper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. TheMonitor only observes the message exchanges between the protocol entities (PEs) remotely and doesnot access internal protocol state. At runtime, it builds a causal graph between the PEs based on theircommunication and uses this together with a rule base of allowed state transition paths to diagnose thefailure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfectcoverage. The hierarchical Monitor framework allows distributed diagnosis handling failures at individualMonitors. The framework is implemented and applied to a reliable multicast protocol executing on ourcampus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency ofthe diagnosis.