Multi-level comparison of data deduplication in a backup scenario

  • Authors:
  • Dirk Meister;André Brinkmann

  • Affiliations:
  • Paderborn Center for Parallel Computing;Paderborn Center for Parallel Computing

  • Venue:
  • SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data deduplication systems detect redundancies between data blocks to either reduce storage needs or to reduce network traffic. A class of deduplication systems splits the data stream into data blocks (chunks) and then finds exact duplicates of these blocks. This paper compares the influence of different chunking approaches on multiple levels. On a macroscopic level, we compare the chunking approaches based on real-life user data in a weekly full backup scenario, both at a single point in time as well as over several weeks. In addition, we analyze how small changes affect the deduplication ratio for different file types on a microscopic level for chunking approaches and delta encoding. An intuitive assumption is that small semantic changes on documents cause only small modifications in the binary representation of files, which would imply a high ratio of deduplication. We will show that this assumption is not valid for many important file types and that application-specific chunking can help to further decrease storage capacity demands.