Generating realistic datasets for deduplication analysis

  • Authors:
  • Vasily Tarasov;Amar Mudrankit;Will Buik;Philip Shilane;Geoff Kuenning;Erez Zadok

  • Affiliations:
  • Stony Brook University;Stony Brook University;Harvey Mudd College;EMC Corporation;Harvey Mudd College;Stony Brook University

  • Venue:
  • USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Deduplication is a popular component of modern storage systems, with a wide variety of approaches. Unlike traditional storage systems, deduplication performance depends on data content as well as access patterns and meta-data characteristics. Most datasets that have been used to evaluate deduplication systems are either unrepresentative, or unavailable due to privacy issues, preventing easy comparison of competing algorithms. Understanding how both content and meta-data evolve is critical to the realistic evaluation of deduplication systems. We developed a generic model of file system changes based on properties measured on terabytes of real, diverse storage systems. Our model plugs into a generic framework for emulating file system changes. Building on observations from specific environments, the model can generate an initial file system followed by ongoing modifications that emulate the distribution of duplicates and file sizes, realistic changes to existing files, and file system growth. In our experiments we were able to generate a 4TB dataset within 13 hours on a machine with a single disk drive. The relative error of emulated parameters depends on the model size but remains within 15% of real-world observations.