HARDFS: hardening HDFS with selective and lightweight versioning

Authors:
Thanh Do;Tyler Harter;Yingchao Liu;Haryadi S. Gunawi;Andrea C. Arpaci-Dusseau;Remzi H. Arpaci-Dusseau
Affiliations:
University of Wisconsin, Madison;University of Wisconsin, Madison;University of Wisconsin, Madison;University of Chicago;University of Wisconsin, Madison;University of Wisconsin, Madison
Venue:
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Year:
2013

Citing 42
Cited 0

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
The HP AutoRAID hierarchical storage system

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
BASE: using abstraction to improve fault tolerance

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
Disk Shadowing

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving the reliability of commodity operating systems

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Commercial Fault Tolerance: A Tale of Two Systems

IEEE Transactions on Dependable and Secure Computing
Reliability and security of RAID storage systems and D2D archives using SATA disk drives

ACM Transactions on Storage (TOS)
When do changes induce fixes?

MSR '05 Proceedings of the 2005 international workshop on Mining software repositories
Semantically-Smart Disk Systems

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Recovering device drivers

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
EXPLODE: a lightweight, general system for finding serious storage system errors

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
An improved construction for counting bloom filters

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Paxos made live: an engineering perspective

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Improving file system reliability with I/O shepherding

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
On designing and deploying internet-scale services

LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Toward a cloud computing research agenda

ACM SIGACT News
MODIST: transparent model checking of unmodified distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Upright cluster services

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Boom analytics: exploring data-centric, declarative programming for the cloud

Proceedings of the 5th European conference on Computer systems
Has the bug really been fixed?

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Membrane: operating system support for restartable file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Tolerating file-system mistakes with EnvyFS

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
How do fixes become bugs?

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
A file is not a file: understanding the I/O behavior of Apple desktop applications

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Recon: verifying file system consistency at runtime

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Practical hardening of crash-tolerant systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

We harden the Hadoop Distributed File System (HDFS) against fail-silent (non fail-stop) behaviors that result from memory corruption and software bugs using a new approach: selective and lightweight versioning (SLEEVE). With this approach, actions performed by important subsystems of HDFS (e.g., namespace management) are checked by a second implementation of the subsystem that uses lightweight, approximate data structures. We show that HARDFS detects and recovers from a wide range of fail-silent behaviors caused by random bit flips, targeted corruptions, and real software bugs. In particular, HARDFS handles 90% of the fail-silent faults that result from random memory corruption and correctly detects and recovers from 100% of 78 targeted corruptions and 5 real-world bugs. Moreover, it recovers orders of magnitude faster than full reboot by using micro-recovery. The extra protection in HARDFS incurs minimal performance and space overheads.