Estimating duplication by content-based sampling

  • Authors:
  • Fei Xie;Michael Condict;Sandip Shete

  • Affiliations:
  • Advanced Technology Group, NetApp Inc.;Advanced Technology Group, NetApp Inc.;Advanced Technology Group, NetApp Inc.

  • Venue:
  • USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We define a new technique for accurately estimating the amount of duplication in a storage volume from a small sample and we analyze its performance and accuracy. The estimate is useful for determining whether it is worthwhile to incur the overhead of deduplication. The technique works by scanning the fingerprints of every block in the volume, but only including in the sample a single copy of each fingerprint that passes a filter. The selectivity of the filter is repeatedly increased while reading the fingerprints, to produce the target sample size. We show that the required sample size for a reasonable accuracy is small and independent of the size of the volume. In addition, we define and analyze an on-line technique that, once an initial scan of all fingerprints has been performed, efficiently maintains an up-to-date estimate of the duplication as the file system is modified. Experiments with various real data sets show that the accuracy is as predicted by theory. We also prototyped the proposed technique in an enterprise storage system and measured the performance overhead using the IOzone micro-benchmark.