Making problem diagnosiswork for large-scale, production storage systems

  • Authors:
  • Michael P. Kasick;Priya Narasimhan;Kevin Harms

  • Affiliations:
  • Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Argonne Leadership Computing Facility, Argonne National Laboratory, Argonne, IL

  • Venue:
  • LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Intrepid has a very-large, production GPFS storage system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total disks. In such a large system, performance problems are both inevitable and difficult to troubleshoot. We present our experiences, of taking an automated problem diagnosis approach from proof-of-concept on a 12-server test-bench parallel-file-system cluster, and making it work on Intrepid's storage system. We also present a 15-month case study, of problems observed from the analysis of 624GB of Intrepid's instrumentation data, in which we diagnose a variety of performance-related storage-system problems, in a matter of hours, as compared to the days or longer with manual approaches.