Architecture-driven diagnosis of performance failures in a token ring

Authors:
Andrew Williams;Priya Narasimhan
Affiliations:
Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA
Venue:
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
Year:
2007

Citing 8
Cited 0

A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
MEAD: support for Real-Time Fault-Tolerant CORBA: Research Articles

Concurrency and Computation: Practice & Experience - Foundations of Middleware Technologies
Using runtime paths for macroanalysis

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Communication infrastructures that provide distributed systems with key services can also end up being the medium whereby faults propagate through the system. We have previously observed that a single faulty node can degrade the performance of other, non-faulty nodes in the system. We present a method for identifying the node that is the origin of the failure by examining the architecture-driven constrained network-flows in a distributed system. By identifying the effects of the failure on the network, combined with our knowledge of the network-flow constraints, we can trace the effects of the failure back to its source node. We empirically evaluate our methods on a data set that was generated by injecting multiple performance-faults into a replicated middleware system with an underlying token-ring based group communication protocol. We correctly identify the faulty node in the case of failures that significantly change the performance characteristics of the network.