Anchor-driven subchunk deduplication

  • Authors:
  • Bartłomiej Romański;Łukasz Heldt;Wojciech Kilian;Krzysztof Lichota;Cezary Dubnicki

  • Affiliations:
  • 9LivesData, LLC, Warsaw, Poland;9LivesData, LLC, Warsaw, Poland;9LivesData, LLC, Warsaw, Poland;9LivesData, LLC, Warsaw, Poland;9LivesData, LLC, Warsaw, Poland

  • Venue:
  • Proceedings of the 4th Annual International Conference on Systems and Storage
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data deduplication, implemented usually with content defined chunking (CDC), is today one of key features of advanced storage systems providing space for backup applications. Although simple and effective, CDC generates chunks with sizes clustered around expected chunk size, which is globally fixed for a given storage system and applies to all backups. This creates opportunity for improvement, as the optimal chunk size for deduplication varies not only among backup datasets, but also within one dataset: long stretches of unchanged data favor larger chunks, whereas regions of change prefer smaller ones. In this work, we present a new algorithm which deduplicates with big chunks as well as with their subchunks using a deduplication context containing subchunk-to-container-chunk mappings. When writing data, this context is constructed on-the fly with so-called anchor sequences defined as short sequences of adjacent chunks in a backup stream (a stream of data produced by backup application containing backed up files). For each anchor sequence, we generate an anchor -- a special block with set of mappings covering a contiguous region of the backup stream positioned ahead of this anchor sequence. If anchor sequences have not changed between backups, the mappings created with the previous backup are prefetched and added to the deduplication context. It is of limited size and fits in the main memory unlike solutions which require keeping all subchunk mappings for the entire backup stream. At the same time, the context provides most of mappings needed for subchunk deduplication. Compared to CDC, the new algorithm results in up to 25% dedup ratio improvement achieved with almost 3 times larger average block size, as verified by simulations driven by real backup traces.