Architecture-driven diagnosis of performance failures in a token ring

  • Authors:
  • Andrew Williams;Priya Narasimhan

  • Affiliations:
  • Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA

  • Venue:
  • HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Communication infrastructures that provide distributed systems with key services can also end up being the medium whereby faults propagate through the system. We have previously observed that a single faulty node can degrade the performance of other, non-faulty nodes in the system. We present a method for identifying the node that is the origin of the failure by examining the architecture-driven constrained network-flows in a distributed system. By identifying the effects of the failure on the network, combined with our knowledge of the network-flow constraints, we can trace the effects of the failure back to its source node. We empirically evaluate our methods on a data set that was generated by injecting multiple performance-faults into a replicated middleware system with an underlying token-ring based group communication protocol. We correctly identify the faulty node in the case of failures that significantly change the performance characteristics of the network.