An analysis of latent sector errors in disk drives

Authors:
Lakshmi N. Bairavasundaram;Garth R. Goodson;Shankar Pasupathy;Jiri Schindler
Affiliations:
University of Wisconsin-Madison;Network Appliance, Inc.;Network Appliance, Inc.;Network Appliance, Inc.
Venue:
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Year:
2007

Citing 15
Cited 63

A fast file system for UNIX

ACM Transactions on Computer Systems (TOCS)
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Analysis of methods for scheduling low priority disk drive tasks

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Characterizing large storage systems: error behavior and performance benchmarks

Characterizing large storage systems: error behavior and performance benchmarks
Monitoring hard disks with smart

Linux Journal
Disk Scrubbing in Large Archival Storage Systems

MASCOTS '04 Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
More Than an Interface---SCSI vs. ATA

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
A fresh look at the reliability of long-term digital storage

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Matrix methods for lost data reconstruction in erasure codes

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
On multidimensional data and modern disks

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies

Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Improving file system reliability with I/O shepherding

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Disaster recovery codes: increasing reliability with large-stripe erasure correcting codes

Proceedings of the 2007 ACM workshop on Storage security and survivability
Pergamum: replacing tape with energy efficient, reliable, disk-based archival storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Parity lost and parity regained

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Idle read after write: IRAW

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
FlexVol: flexible, efficient file volume virtualization in WAFL

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Free factories: unified infrastructure for data intensive web services

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics

ACM Transactions on Storage (TOS)
An analysis of data corruption in the storage stack

ACM Transactions on Storage (TOS)
Undetected disk errors in RAID arrays

IBM Journal of Research and Development
GRID codes: Strip-based erasure codes with high fault tolerance for storage systems

ACM Transactions on Storage (TOS)
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
Smoke and mirrors: reflecting files at a geographically remote location without loss of performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
Efficient management of idleness in storage systems

ACM Transactions on Storage (TOS)
Restrained utilization of idleness for transparent scheduling of background tasks

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Higher reliability redundant disk arrays: Organization, operation, and coding

ACM Transactions on Storage (TOS)
Tolerating hardware device failures in software

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Uncovering errors: the cost of detecting silent data corruption

Proceedings of the 4th Annual Workshop on Petascale Data Storage
Extract and infer quickly: Obtaining sector geometry of modern hard disk drives

ACM Transactions on Storage (TOS)
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Understanding latent sector errors and how to protect against them

ACM Transactions on Storage (TOS)
End-to-end data integrity for file systems: a ZFS case study

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A clean-slate look at disk scrubbing

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Understanding latent sector errors and how to protect against them

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
On the impact of disk scrubbing on energy savings

HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
SQCK: a declarative file system checker

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Tolerating file-system mistakes with EnvyFS

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Predicting disk failures with HMM- and HSMM-based approaches

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reliability analysis of deduplicated and erasure-coded storage

ACM SIGMETRICS Performance Evaluation Review
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations

ACM Transactions on Storage (TOS)
Request Bridging and Interleaving: Improving the Performance of Small Synchronous Updates under Seek-Optimizing Disk Subsystems

ACM Transactions on Storage (TOS)
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

ACM Transactions on Storage (TOS)
VMFlock: virtual machine co-migration for the cloud

Proceedings of the 20th international symposium on High performance distributed computing
Sampling + DMR: practical and low-overhead permanent fault detection

Proceedings of the 38th annual international symposium on Computer architecture
TidyFS: a simple and small distributed file system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Italian for beginners: the next steps for SLO-based management

HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Towards reliable storage systems

Towards reliable storage systems
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

ACM Transactions on Storage (TOS)
Definition, detection, and recovery of single-page failures, a fourth class of database failures

Proceedings of the VLDB Endowment
Busy bee: how to use traffic information for better scheduling of background tasks

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Data storage auditing service in cloud computing: challenges, methods and opportunities

World Wide Web
Gibraltar: A Reed-Solomon coding library for storage applications on programmable graphics processors

Concurrency and Computation: Practice & Experience
Temperature management in data centers: why some (might) like it hot
Generalized X-code: An efficient RAID-6 code for arbitrary size of disk array

ACM Transactions on Storage (TOS)
Systems research and innovation in data ONTAP

ACM SIGOPS Operating Systems Review
IDO: intelligent data outsourcing with improved RAID reconstruction performance in large-scale data centers

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Exploiting Redundancies and Deferred Writes to Conserve Energy in Erasure-Coded Storage Clusters

ACM Transactions on Storage (TOS)
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing
Fault isolation and quick recovery in isolation file systems

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Ffsck: The Fast File-System Checker

ACM Transactions on Storage (TOS)
A Study of Linux File System Evolution

ACM Transactions on Storage (TOS)
Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation

ACM Transactions on Storage (TOS)
Ffsck: the fast file system checker

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
A study of Linux file system evolution

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
STAIR codes: a general family of erasure codes for tolerating device and sector failures in practical storage systems

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

The reliability measures in today's disk drive-based storage systems focus predominantly on protecting against complete disk failures. Previous disk reliability studies have analyzed empirical data in an attempt to better understand and predict disk failure rates. Yet, very little is known about the incidence of latent sector errors i.e., errors that go undetected until the corresponding disk sectors are accessed. Our study analyzes data collected from production storage systems over 32 months across 1.53 million disks (both nearline and enterprise class). We analyze factors that impact latent sector errors, observe trends, and explore their implications on the design of reliability mechanisms in storage systems. To the best of our knowledge, this is the first study of such large scale our sample size is at least anorder of magnitude larger than previously published studies and the first one to focus specifically on latent sector errors and their implications on the design and reliability of storage systems.