Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

Authors:
Ilias Iliadis;Robert Haas;Xiao-Yu Hu;Evangelos Eleftheriou
Affiliations:
IBM Zurich Research Laboratory;IBM Zurich Research Laboratory;IBM Zurich Research Laboratory;IBM Zurich Research Laboratory
Venue:
ACM Transactions on Storage (TOS)
Year:
2011

Citing 24
Cited 5

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
An introduction to disk drive modeling

Computer
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Probability and statistics with reliability, queuing and computer science applications

Probability and statistics with reliability, queuing and computer science applications
Disk Scrubbing in Large Archival Storage Systems

MASCOTS '04 Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Using device diversity to protect data against batch-correlated disk failures

Proceedings of the second ACM workshop on Storage security and survivability
A fresh look at the reliability of long-term digital storage

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Enhanced Reliability Modeling of RAID Storage Systems

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk drive level workload characterization

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

ACM Transactions on Storage (TOS)
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Idle read after write: IRAW

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Reliability Assurance of RAID Storage Systems for a Wide Range of Latent Sector Errors

NAS '08 Proceedings of the 2008 International Conference on Networking, Architecture, and Storage
Higher reliability redundant disk arrays: Organization, operation, and coding

ACM Transactions on Storage (TOS)
A clean-slate look at disk scrubbing

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Understanding latent sector errors and how to protect against them

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
On the impact of disk scrubbing on energy savings

HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
Mean time to meaningless: MTTDL, Markov models, and storage system reliability

HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
Row-diagonal parity for double disk failure correction

FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies

Survey and analysis of disk scheduling methods

ACM SIGARCH Computer Architecture News
Rebuild processing in RAID5 with emphasis on the supplementary parity augmentation method[37]

ACM SIGARCH Computer Architecture News
Performance, reliability, and performability of a hybrid RAID array and a comparison with traditional RAID1 arrays

Cluster Computing
Hierarchical RAID: Design, performance, reliability, and recovery

Journal of Parallel and Distributed Computing
STAIR codes: a general family of erasure codes for tolerating device and sector failures in practical storage systems

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Two schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID systems are examined. The first scheme is the established, widely used, disk scrubbing scheme, which operates by periodically accessing disk drives to detect media-related unrecoverable errors. These errors are subsequently corrected by rebuilding the sectors affected. The second scheme is the recently proposed intradisk redundancy scheme, which uses a further level of redundancy inside each disk, in addition to the RAID redundancy across multiple disks. A new model is developed to evaluate the extent to which disk scrubbing reduces the unrecoverable sector errors. The probability of encountering unrecoverable sector errors is derived analytically under very general conditions regarding the characteristics of the read/write process of uniformly distributed random workloads and for a broad spectrum of disk scrubbing schemes, which includes the deterministic and random scrubbing schemes. We show that the deterministic scrubbing scheme is the most efficient one. We also derive closed-form expressions for the percentage of unrecoverable sector errors that the scrubbing scheme detects and corrects, the throughput performance, and the minimum scrubbing period achievable under operation with random, uniformly distributed I/O requests. Our results demonstrate that the reliability improvement due to disk scrubbing depends on the scrubbing frequency and the load of the system, and, for heavy-write workloads, may not reach the reliability level achieved by a simple interleaved parity-check (IPC)-based intradisk redundancy scheme, which is insensitive to the load. In fact, for small unrecoverable sector error probabilities, the IPC-based intradisk redundancy scheme achieves essentially the same reliability as that of a system operating without unrecoverable sector errors. For heavy loads, the reliability achieved by the scrubbing scheme can be orders of magnitude less than that of the intradisk redundancy scheme. Finally, the I/O and throughput performances are evaluated by means of analysis and event-driven simulation.