A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering
Proceedings of the 24th annual international symposium on Computer architecture
Efficient Placement of Parity and Data to Tolerate Two Disk Failures in Disk Array Systems
IEEE Transactions on Parallel and Distributed Systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Commercial Fault Tolerance: A Tale of Two Systems
IEEE Transactions on Dependable and Secure Computing
Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Ensuring data integrity in storage: techniques and applications
Proceedings of the 2005 ACM workshop on Storage security and survivability
Matrix methods for lost data reconstruction in erasure codes
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
WEAVER codes: highly fault tolerant erasure codes for storage systems
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
An analysis of latent sector errors in disk drives
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Parity lost and parity regained
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
FlexVol: flexible, efficient file volume virtualization in WAFL
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
ACM Transactions on Storage (TOS)
Undetected disk errors in RAID arrays
IBM Journal of Research and Development
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
Higher reliability redundant disk arrays: Organization, operation, and coding
ACM Transactions on Storage (TOS)
Uncovering errors: the cost of detecting silent data corruption
Proceedings of the 4th Annual Workshop on Petascale Data Storage
Extract and infer quickly: Obtaining sector geometry of modern hard disk drives
ACM Transactions on Storage (TOS)
DARC: design and evaluation of an I/O controller for data protection
Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Keeping bits safe: how hard can it be?
Communications of the ACM
End-to-end data integrity for file systems: a ZFS case study
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
SQCK: a declarative file system checker
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Tolerating file-system mistakes with EnvyFS
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Reliable data-center scale computations
Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
Mean time to meaningless: MTTDL, Markov models, and storage system reliability
HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
Keeping Bits Safe: How Hard Can It Be?
Queue - Storage
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
High performance multi-node file copies and checksums for clustered file systems
LISA'10 Proceedings of the 24th international conference on Large installation system administration
Using Paxos to build a scalable, consistent, and highly available datastore
Proceedings of the VLDB Endowment
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
ACM Transactions on Storage (TOS)
Paxos replicated state machines as the basis of a high-performance data store
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Warding off the dangers of data corruption with amulet
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Towards reliable storage systems
Towards reliable storage systems
Fast black-box testing of system recovery code
Proceedings of the 7th ACM european conference on Computer Systems
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories
ACM Transactions on Storage (TOS)
Definition, detection, and recovery of single-page failures, a fourth class of database failures
Proceedings of the VLDB Endowment
Fault resilience of the algebraic multi-grid solver
Proceedings of the 26th ACM international conference on Supercomputing
Erasure coding in windows azure storage
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads
Proceedings of the VLDB Endowment
Understanding data survivability in archival storage systems
Proceedings of the 5th Annual International Systems and Storage Conference
A taxonomy of biometric system vulnerabilities and defences
International Journal of Biometrics
*-Box: towards reliability and consistency in dropbox-like file synchronization services
HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Ffsck: The Fast File-System Checker
ACM Transactions on Storage (TOS)
A Study of Linux File System Evolution
ACM Transactions on Storage (TOS)
Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems
ACM Transactions on Storage (TOS)
Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation
ACM Transactions on Storage (TOS)
Ffsck: the fast file system checker
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
A study of Linux file system evolution
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
SD codes: erasure codes designed for how storage systems really fail
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
HARDFS: hardening HDFS with selective and lightweight versioning
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
ViewBox: integrating local file systems with cloud storage services
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 0.02 |
An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this paper, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most. We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances including: (i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise class disk drives, (ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and (iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design.