HARDFS: hardening HDFS with selective and lightweight versioning

  • Authors:
  • Thanh Do;Tyler Harter;Yingchao Liu;Haryadi S. Gunawi;Andrea C. Arpaci-Dusseau;Remzi H. Arpaci-Dusseau

  • Affiliations:
  • University of Wisconsin, Madison;University of Wisconsin, Madison;University of Wisconsin, Madison;University of Chicago;University of Wisconsin, Madison;University of Wisconsin, Madison

  • Venue:
  • FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We harden the Hadoop Distributed File System (HDFS) against fail-silent (non fail-stop) behaviors that result from memory corruption and software bugs using a new approach: selective and lightweight versioning (SLEEVE). With this approach, actions performed by important subsystems of HDFS (e.g., namespace management) are checked by a second implementation of the subsystem that uses lightweight, approximate data structures. We show that HARDFS detects and recovers from a wide range of fail-silent behaviors caused by random bit flips, targeted corruptions, and real software bugs. In particular, HARDFS handles 90% of the fail-silent faults that result from random memory corruption and correctly detects and recovers from 100% of 78 targeted corruptions and 5 real-world bugs. Moreover, it recovers orders of magnitude faster than full reboot by using micro-recovery. The extra protection in HARDFS incurs minimal performance and space overheads.