A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
The HP AutoRAID hierarchical storage system
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Practical Byzantine fault tolerance
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
BASE: using abstraction to improve fault tolerance
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving the reliability of commodity operating systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Commercial Fault Tolerance: A Tale of Two Systems
IEEE Transactions on Dependable and Secure Computing
Reliability and security of RAID storage systems and D2D archives using SATA disk drives
ACM Transactions on Storage (TOS)
MSR '05 Proceedings of the 2005 international workshop on Mining software repositories
Semantically-Smart Disk Systems
FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
EXPLODE: a lightweight, general system for finding serious storage system errors
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
An improved construction for counting bloom filters
ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Paxos made live: an engineering perspective
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Improving file system reliability with I/O shepherding
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
On designing and deploying internet-scale services
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Toward a cloud computing research agenda
ACM SIGACT News
MODIST: transparent model checking of unmodified distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Boom analytics: exploring data-centric, declarative programming for the cloud
Proceedings of the 5th European conference on Computer systems
Has the bug really been fixed?
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Membrane: operating system support for restartable file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Tolerating file-system mistakes with EnvyFS
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
FATE and DESTINI: a framework for cloud recovery testing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Fast crash recovery in RAMCloud
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
A file is not a file: understanding the I/O behavior of Apple desktop applications
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Recon: verifying file system consistency at runtime
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Practical hardening of crash-tolerant systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Hi-index | 0.00 |
We harden the Hadoop Distributed File System (HDFS) against fail-silent (non fail-stop) behaviors that result from memory corruption and software bugs using a new approach: selective and lightweight versioning (SLEEVE). With this approach, actions performed by important subsystems of HDFS (e.g., namespace management) are checked by a second implementation of the subsystem that uses lightweight, approximate data structures. We show that HARDFS detects and recovers from a wide range of fail-silent behaviors caused by random bit flips, targeted corruptions, and real software bugs. In particular, HARDFS handles 90% of the fail-silent faults that result from random memory corruption and correctly detects and recovers from 100% of 78 targeted corruptions and 5 real-world bugs. Moreover, it recovers orders of magnitude faster than full reboot by using micro-recovery. The extra protection in HARDFS incurs minimal performance and space overheads.