Invariants Based Failure Diagnosis in Distributed Computing Systems

  • Authors:
  • Haifeng Chen;Guofei Jiang;Kenji Yoshihira;Akhilesh Saxena

  • Affiliations:
  • -;-;-;-

  • Venue:
  • SRDS '10 Proceedings of the 2010 29th IEEE Symposium on Reliable Distributed Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents an instance based approach to diagnosing failures in computing systems. Owing to the fact that a large portion of occurred failures are repeated ones, our method takes advantage of past experiences by storing historical failures in a database and retrieving similar instances in the occurrence of failure. We extract the system ‘invariants’ by modeling consistent dependencies between system attributes during the operation, and construct a network graph based on the learned invariants. When a failure happens, the status of invariants network, i.e., whether each invariant link is broken or not, provides a view of failure characteristics. We use a high dimensional binary vector to store those failure evidences, and develop a novel algorithm to efficiently retrieve failure signatures from the database. Experimental results in a web based system have demonstrated the effectiveness of our method in diagnosing the injected failures.