Automated Online Monitoring of Distributed Applications through External Monitors

Authors:
Gunjan Khanna;Padma Varadharajan;Saurabh Bagchi
Affiliations:
-;-;IEEE
Venue:
IEEE Transactions on Dependable and Secure Computing
Year:
2006

Citing 25
Cited 7

Modeling and Verification of Time Dependent Systems Using Time Petri Nets

IEEE Transactions on Software Engineering
Fault detection with multiple observers

IEEE/ACM Transactions on Networking (TON)
The temporal logic of actions

ACM Transactions on Programming Languages and Systems (TOPLAS)
Observer-A Concept for Formal On-Line Validation of Distributed Systems

IEEE Transactions on Software Engineering
Schemes for fault identification in communication networks

IEEE/ACM Transactions on Networking (TON)
A unified approach to fault-tolerance in communication protocols based on recovery procedures

IEEE/ACM Transactions on Networking (TON)
Automated packet trace analysis of TCP implementations

SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
Specification and verification of fault-tolerance, timing, and scheduling

ACM Transactions on Programming Languages and Systems (TOPLAS)
What packets may come: automata for network monitoring

POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Partial-Order Reduction in Symbolic State-Space Exploration

Formal Methods in System Design - Special issue on CAV '97
Symbolic Model Checking

Symbolic Model Checking
Detection of Summative Global Predicates

ICPADS '97 Proceedings of the 1997 International Conference on Parallel and Distributed Systems
From Crash Fault-Tolerance to Arbitrary-Fault Tolerance: Towards a Modular Approach

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
On the Quality of Service of Failure Detectors

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
A Framework for Database Audit and Control Flow Checking for a Wireless Telephone Network Controller

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A Compositional Approach to Monitoring Distributed Systems

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Basic notions of trace theory

Linear Time, Branching Time and Partial Order in Logics and Models for Concurrency, School/Workshop
Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining

ACM Transactions on Computer Systems (TOCS)
Automatic alarm correlation for fault identification

INFOCOM '95 Proceedings of the Fourteenth Annual Joint Conference of the IEEE Computer and Communication Societies (Vol. 2)-Volume - Volume 2
Deadlock Detection in Communicating Finite State Machines by Even Reachability Analysis

ICCCN '95 Proceedings of the 4th International Conference on Computer Communications and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
TRAM: A Tree-based Reliable Multicast Protocol

TRAM: A Tree-based Reliable Multicast Protocol
Failure Handling in a Reliable Multicast Protocol for Improving Buffer Utilization and Accommodating Heterogeneous Receivers

PRDC '04 Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04)
Self Checking Network Protocols: A Monitor Based Approach

SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems

How to keep your head above water while detecting errors

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
How to keep your head above water while detecting errors

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Error detection framework for complex software systems

EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
Constructing formal rules to verify message communication in distributed systems

The Journal of Supercomputing
A proposal to detect errors in Enterprise Application Integration solutions

Journal of Systems and Software
A decentralized approach for mining event correlations in distributed system monitoring

Journal of Parallel and Distributed Computing
Specification and verification of reliability in dispatching multicast messages

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is a challenge to provide detection facilities for large-scale distributed systems running legacy code on hosts that may not allow fault tolerant functions to execute on them. It is tempting to structure the detection in an observer system that is kept separate from the observed system of protocol entities, with the former only having access to the latter's external message exchanges. In this paper, we propose an autonomous self-checking Monitor system, which is used to provide fast detection to underlying network protocols. The Monitor architecture is application neutral and, therefore, lends itself to deployment for different protocols, with the rulebase against which the observed interactions are matched, making it specific to a protocol. To make the detection infrastructure scalable and dependable, we extend it to a hierarchical Monitor structure. The Monitor structure is made dynamic and reconfigurable by designing different interactions to cope with failures, load changes, or mobility. The latency of the Monitor system is evaluated under fault free conditions, while its coverage is evaluated under simulated error injections.