Making problem diagnosiswork for large-scale, production storage systems

Authors:
Michael P. Kasick;Priya Narasimhan;Kevin Harms
Affiliations:
Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Argonne Leadership Computing Facility, Argonne National Laboratory, Argonne, IL
Venue:
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Year:
2013

Citing 20
Cited 0

Elements of information theory

Elements of information theory
The Paradyn Parallel Performance Measurement Tool

Computer
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
I/O performance challenges at leadership scale

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Fingerprinting the datacenter: automated classification of performance crises

Proceedings of the 5th European conference on Computer systems
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems

Proceedings of the 7th international conference on Autonomic computing
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Visual analysis of I/O system behavior for high-end computing

Proceedings of the third international workshop on Large-scale system and application performance
Understanding and Improving Computational Science Storage Access through Continuous Characterization

ACM Transactions on Storage (TOS)
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks
Draco: Statistical diagnosis of chronic problems in large distributed systems

DSN '12 Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Latent fault detection in large scale services

DSN '12 Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Theia: visual signatures for problem diagnosis in large hadoop clusters

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Intrepid has a very-large, production GPFS storage system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total disks. In such a large system, performance problems are both inevitable and difficult to troubleshoot. We present our experiences, of taking an automated problem diagnosis approach from proof-of-concept on a 12-server test-bench parallel-file-system cluster, and making it work on Intrepid's storage system. We also present a 15-month case study, of problems observed from the analysis of 624GB of Intrepid's instrumentation data, in which we diagnose a variety of performance-related storage-system problems, in a matter of hours, as compared to the days or longer with manual approaches.