An Analysis of Error Behaviour in a Large Storage System

Authors:
Nisha Talagala;David Patterson
Affiliations:
-;-
Venue:
An Analysis of Error Behaviour in a Large Storage System
Year:
1999

Citing 0
Cited 8

Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Run-time adaptation in river

ACM Transactions on Computer Systems (TOCS)
More Than an Interface---SCSI vs. ATA

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Experiences in measuring the reliability of a cache-based storage system

WIESS'00 Proceedings of the 1st conference on Industrial Experiences with Systems Software - Volume 1
Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics

ACM Transactions on Storage (TOS)
More than an interface: scsi vs. ata

FAST'03 Proceedings of the 2nd USENIX conference on File and storage technologies
Scalable testing of file system checkers

Proceedings of the 7th ACM european conference on Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper analyzes the error behavior of a 3.2TB disk storage system. We report reliability data for 18 months of the prototype''s operation, and analyze 6 months of error logs from nodes in the prototype. We found that the disks drives were among the most reliable components in the system. We were also able to divide errors into eleven categories, comprising disk errors, network errors and SCSI errors that appeared repeatedly across all nodes. We also gained insight into the types of error messages reported by devices in various conditions, and the effects of these events on the operating system. We also present data from four cases of disk drive failures. These results and insights should be useful to any designer of a fault tolerant storage system.