ACM Transactions on Computer Systems (TOCS)
A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Analysis of methods for scheduling low priority disk drive tasks
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Characterizing large storage systems: error behavior and performance benchmarks
Characterizing large storage systems: error behavior and performance benchmarks
Monitoring hard disks with smart
Linux Journal
Disk Scrubbing in Large Archival Storage Systems
MASCOTS '04 Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
More Than an Interface---SCSI vs. ATA
FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
A fresh look at the reliability of long-term digital storage
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Matrix methods for lost data reconstruction in erasure codes
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
On multidimensional data and modern disks
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
Improving file system reliability with I/O shepherding
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Disaster recovery codes: increasing reliability with large-stripe erasure correcting codes
Proceedings of the 2007 ACM workshop on Storage security and survivability
Pergamum: replacing tape with energy efficient, reliable, disk-based archival storage
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Parity lost and parity regained
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
FlexVol: flexible, efficient file volume virtualization in WAFL
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Free factories: unified infrastructure for data intensive web services
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
ACM Transactions on Storage (TOS)
An analysis of data corruption in the storage stack
ACM Transactions on Storage (TOS)
Undetected disk errors in RAID arrays
IBM Journal of Research and Development
GRID codes: Strip-based erasure codes with high fault tolerance for storage systems
ACM Transactions on Storage (TOS)
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
Smoke and mirrors: reflecting files at a geographically remote location without loss of performance
FAST '09 Proccedings of the 7th conference on File and storage technologies
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance
FAST '09 Proccedings of the 7th conference on File and storage technologies
Efficient management of idleness in storage systems
ACM Transactions on Storage (TOS)
Restrained utilization of idleness for transparent scheduling of background tasks
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Higher reliability redundant disk arrays: Organization, operation, and coding
ACM Transactions on Storage (TOS)
Tolerating hardware device failures in software
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Uncovering errors: the cost of detecting silent data corruption
Proceedings of the 4th Annual Workshop on Petascale Data Storage
Extract and infer quickly: Obtaining sector geometry of modern hard disk drives
ACM Transactions on Storage (TOS)
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Understanding latent sector errors and how to protect against them
ACM Transactions on Storage (TOS)
End-to-end data integrity for file systems: a ZFS case study
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A clean-slate look at disk scrubbing
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Understanding latent sector errors and how to protect against them
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
On the impact of disk scrubbing on energy savings
HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
SQCK: a declarative file system checker
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Tolerating file-system mistakes with EnvyFS
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Predicting disk failures with HMM- and HSMM-based approaches
ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reliability analysis of deduplicated and erasure-coded storage
ACM SIGMETRICS Performance Evaluation Review
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations
ACM Transactions on Storage (TOS)
ACM Transactions on Storage (TOS)
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems
ACM Transactions on Storage (TOS)
VMFlock: virtual machine co-migration for the cloud
Proceedings of the 20th international symposium on High performance distributed computing
Sampling + DMR: practical and low-overhead permanent fault detection
Proceedings of the 38th annual international symposium on Computer architecture
TidyFS: a simple and small distributed file system
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Italian for beginners: the next steps for SLO-based management
HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Towards reliable storage systems
Towards reliable storage systems
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories
ACM Transactions on Storage (TOS)
Definition, detection, and recovery of single-page failures, a fourth class of database failures
Proceedings of the VLDB Endowment
Busy bee: how to use traffic information for better scheduling of background tasks
ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Concurrency and Computation: Practice & Experience
Generalized X-code: An efficient RAID-6 code for arbitrary size of disk array
ACM Transactions on Storage (TOS)
Systems research and innovation in data ONTAP
ACM SIGOPS Operating Systems Review
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Robustness in the Salus scalable block store
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Exploiting Redundancies and Deferred Writes to Conserve Energy in Erasure-Coded Storage Clusters
ACM Transactions on Storage (TOS)
Limplock: understanding the impact of limpware on scale-out cloud systems
Proceedings of the 4th annual Symposium on Cloud Computing
Fault isolation and quick recovery in isolation file systems
HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Ffsck: The Fast File-System Checker
ACM Transactions on Storage (TOS)
A Study of Linux File System Evolution
ACM Transactions on Storage (TOS)
Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation
ACM Transactions on Storage (TOS)
Ffsck: the fast file system checker
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
A study of Linux file system evolution
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
The reliability measures in today's disk drive-based storage systems focus predominantly on protecting against complete disk failures. Previous disk reliability studies have analyzed empirical data in an attempt to better understand and predict disk failure rates. Yet, very little is known about the incidence of latent sector errors i.e., errors that go undetected until the corresponding disk sectors are accessed. Our study analyzes data collected from production storage systems over 32 months across 1.53 million disks (both nearline and enterprise class). We analyze factors that impact latent sector errors, observe trends, and explore their implications on the design of reliability mechanisms in storage systems. To the best of our knowledge, this is the first study of such large scale our sample size is at least anorder of magnitude larger than previously published studies and the first one to focus specifically on latent sector errors and their implications on the design and reliability of storage systems.