Enhanced Reliability Modeling of RAID Storage Systems

Authors:
Jon G. Elerath;Michael Pecht
Affiliations:
Network Appliance, Inc.;University of Maryland, USA
Venue:
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Year:
2007

Citing 0
Cited 19

Hard Disk Drives: The Good, the Bad and the Ugly!

Queue - File Systems and Storage
Parity lost and parity regained

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Idle read after write: IRAW

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Free factories: unified infrastructure for data intensive web services

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Hard-disk drives: the good, the bad, and the ugly

Communications of the ACM - One Laptop Per Child: Vision vs. Reality
Higher reliability redundant disk arrays: Organization, operation, and coding

ACM Transactions on Storage (TOS)
Understanding latent sector errors and how to protect against them

ACM Transactions on Storage (TOS)
Keeping bits safe: how hard can it be?

Communications of the ACM
A clean-slate look at disk scrubbing

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Understanding latent sector errors and how to protect against them

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A spin-up saved is energy earned: achieving power-efficient, erasure-coded storage

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Mean time to meaningless: MTTDL, Markov models, and storage system reliability

HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
Keeping Bits Safe: How Hard Can It Be?

Queue - Storage
Minimum density RAID-6 codes

ACM Transactions on Storage (TOS)
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations

ACM Transactions on Storage (TOS)
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

ACM Transactions on Storage (TOS)
HPDA: A hybrid parity-based disk array for enhanced performance and reliability

ACM Transactions on Storage (TOS)
Understanding data survivability in archival storage systems

Proceedings of the 5th Annual International Systems and Storage Conference

Quantified Score

Hi-index	0.02

Visualization

Abstract

A flexible model for estimating reliability of RAID storage systems is presented. This model corrects errors associated with the common assumption that system times to failure follow a homogeneous Poisson process. Separate generalized failure distributions are used to model catastrophic failures and usage dependent data corruptions for each hard drive. Catastrophic failure restoration is represented by a three-parameter Weibull, so the model can include a minimum time to restore as a function of data transfer rate and hard drive storage capacity. Data can be scrubbed as a background operation to eliminate corrupted data that, in the event of a simultaneous catastrophic failure, results in double disk failures. Field-based times to failure data and mathematic justification for a new model are presented. Model results have been verified and predict between 2 to 1,500 times as many double disk failures as that estimated using the current mean time to data loss method.