Disk Scrubbing in Large Archival Storage Systems

Authors:
Thomas J. E. Schwarz;Qin Xin;Ethan L. Miller;Darrell D. E. Long;Andy Hospodor;Spencer Ng
Affiliations:
Santa Clara University and University of California at Santa Cruz;University of California at Santa Cruz and Hitachi Global Storage Technologies;University of California at Santa Cruz;University of California at Santa Cruz;Santa Clara University;Hitachi Global Storage Technologies
Venue:
MASCOTS '04 Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Year:
2004

Citing 0
Cited 36

IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Using device diversity to protect data against batch-correlated disk failures

Proceedings of the second ACM workshop on Storage security and survivability
Limiting trust in the storage stack

Proceedings of the second ACM workshop on Storage security and survivability
A fresh look at the reliability of long-term digital storage

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disaster recovery codes: increasing reliability with large-stripe erasure correcting codes

Proceedings of the 2007 ACM workshop on Storage security and survivability
SafeStore: a durable and practical storage system

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Pergamum: replacing tape with energy efficient, reliable, disk-based archival storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Parity lost and parity regained

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Idle read after write: IRAW

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Undetected disk errors in RAID arrays

IBM Journal of Research and Development
Hard-disk drives: the good, the bad, and the ugly

Communications of the ACM - One Laptop Per Child: Vision vs. Reality
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
Efficient management of idleness in storage systems

ACM Transactions on Storage (TOS)
Restrained utilization of idleness for transparent scheduling of background tasks

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Higher reliability redundant disk arrays: Organization, operation, and coding

ACM Transactions on Storage (TOS)
Adding aggressive error correction to a high-performance compressing flash file system

EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
Maintaining and checking parity in highly available Scalable Distributed Data Structures

Journal of Systems and Software
Cumulative algebraic signatures for fast string search, protection against incidental viewing and corruption of data in an SDDS

DBISP2P'05/06 Proceedings of the 2005/2006 international conference on Databases, information systems, and peer-to-peer computing
Understanding latent sector errors and how to protect against them

ACM Transactions on Storage (TOS)
End-to-end data integrity for file systems: a ZFS case study

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A clean-slate look at disk scrubbing

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Understanding latent sector errors and how to protect against them

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A spin-up saved is energy earned: achieving power-efficient, erasure-coded storage

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
On the impact of disk scrubbing on energy savings

HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations

ACM Transactions on Storage (TOS)
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

ACM Transactions on Storage (TOS)
Warding off the dangers of data corruption with amulet

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Towards reliable storage systems

Towards reliable storage systems
Understanding data survivability in archival storage systems

Proceedings of the 5th Annual International Systems and Storage Conference
Ffsck: The Fast File-System Checker

ACM Transactions on Storage (TOS)
Ffsck: the fast file system checker

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Wear unleveling: improving NAND flash lifetime by balancing page endurance

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
STAIR codes: a general family of erasure codes for tolerating device and sector failures in practical storage systems

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large archival storage systems experience long periods of idleness broken up by rare data accesses. In such systems, disks may remain powered off for long periods of time. These systems can lose data for a variety of reasons, including failures at both the device level and the block level. To deal with these failures, we must detect them early enough to be able to use the redundancy built into the storage system. We propose a process called "disk scrubbing" in a system in which drives are periodically accessed to detect drive failure. By scrubbing all of the data stored on all of the disks, we can detect block failures and compensate for them by rebuilding the affected blocks. Our research shows how the scheduling of disk scrubbing affects overall system reliability, and that "opportunistic" scrubbing, in which the system scrubs disks only when they are powered on for other reasons, performs very well without the need to power on disks solely to check them.