Understanding storage system problems and diagnosing them through log analysis

Authors:
Yuanyuan Zhou;Weihang Jiang
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Understanding storage system problems and diagnosing them through log analysis
Year:
2009

Citing 0
Cited 1

SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays, over 90% new information produced are stored on hard disk drives. The explosion of data is making storage system a strategic investment priority in the enterprise world. The revenue created by storage system industry steadily increases from $14.2 Billion in 2004 to over $18.4 Billion in 2007. As a key component of enterprise systems, reliable storage systems are critical. However, despite the efforts put into building robust storage systems, as the size and complexity of storage systems have grown to an unprecedented level, storage system problems are common. Unfortunately, many aspects of storage system problems are still not well understood, and most of previous studies only focus on one component - disk drives. To better understand storage system problems, we analyzed the failure characteristics of the core part of storage system - the storage subsystem, which contains disks and all components providing connectivity and usage of disk to the entire storage system. More specifically, we analyzed the storage system logs collected from about 39,000 storage systems commercially deployed at various customer sites. The data set covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. Our study reveals many interesting findings, providing useful guideline for designing reliable storage systems. Some of the major findings include: (1) In addition to disk failures that contribute to 20–55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for significant percentages of storage subsystem failures. (2) Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit strong self-correlations. In addition, these failures exhibit bursty patterns. (3) Storage subsystems configured with dual-path interconnects experience 30–40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf. As we found out that storage subsystem problems are far beyond disk failures, we extend the scope of study to various storage system problems, and study the characteristics of storage system problem troubleshooting from various dimensions. Using a large set (636,108) of real world customer problem cases reported from 100,000 commercially deployed storage systems in the last two years, the analysis show that while some problems are either benign, or resolved automatically, many others can take hours or days of manual diagnosis to fix. For modern storage systems, hardware failures and misconfigurations dominate customer cases, but software failures take longer time to resolve. Interestingly, a relatively significant percentage of cases are because customers lack sufficient knowledge about the system. We also evaluate the potential of using storage system logs to resolve these problems. Our analysis shows that a failure message alone is a poor indicator of root cause, and that combining failure messages with multiple log events can improve problem root cause prediction by a factor of three. One key finding is that storage system logs contain useful information for narrowing down the root cause, while they are challenging to analyze manually because they are noisy and the useful log events are often separated by hundreds of irrelevant log events. Motivated by this finding, we designed and implemented an automatic tool, called Log Analyzer, to improve problem troubleshooting process. By applying statistical analysis techniques, the Log Analyzer can automatically infer the dependency relationship between log events, and identify the key log events that capture the essential system states related to storage system problems. By combining classic unsupervised classification techniques - hierarchical clustering with the event ranking techniques, the Log Analyzer can also identify recurrent storage system problems based on similar log patterns, so that previous diagnosis efforts can be systematically retrieved and leveraged. We train the Log Analyze with 18,878 week-long storage system logs and evaluate it with 164 real-world problem cases. The evaluation indicates that the Log Analyzer can effectively reduce the log event number to 3.4%. For most of the 16 real-world problem cases manually annotated with 1–3 key log events, the Log Analyzer accurately ranked the key log events within top 3 without a priori knowledge on how important the events are. For the other 148 problem cases with diagnosis and with root cause information, the Log Analyzer effectively grouped problem cases with the same root cause together with 63–93% accuracy, significantly outperforming other three alternative solutions which only achieve 30–46% accuracy.