Black-box problem diagnosis in parallel file systems

  • Authors:
  • Michael P. Kasick;Jiaqi Tan;Rajeev Gandhi;Priya Narasimhan

  • Affiliations:
  • Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;DSO National Labs, Singapore, Singapore;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA

  • Venue:
  • FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We focus on automatically diagnosing different performance problems in parallel file systems by identifying, gathering and analyzing OS-level, black-box performance metrics on every node in the cluster. Our peer-comparison diagnosis approach compares the statistical attributes of these metrics across I/O servers, to identify the faulty node. We develop a root-cause analysis procedure that further analyzes the affected metrics to pinpoint the faulty resource (storage or network), and demonstrate that this approach works commonly across stripe-based parallel file systems. We demonstrate our approach for realistic storage and network problems injected into three different file-system benchmarks (dd, IOzone, and Post-Mark), in both PVFS and Lustre clusters.