Block locality caching for data deduplication

  • Authors:
  • Dirk Meister;Jürgen Kaiser;André Brinkmann

  • Affiliations:
  • University Mainz;University Mainz;University Mainz

  • Venue:
  • Proceedings of the 6th International Systems and Storage Conference
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data deduplication systems discover and remove redundancies between data blocks by splitting the data stream into chunks and comparing a hash of each chunk with all previously stored hashes. Storing the corresponding chunk index on hard disks immediately limits the achievable throughput, as these devices are unable to support the high number of random IOs induced by this index. Several approaches to overcome this chunk lookup disk bottleneck have been proposed. Often, the approaches try to capture the locality information of a backup run and use this in the next backup run to predict future chunk requests. However, often this locality is only captured by a surrogate, e.g., the order of the chunks in containers. [37]. Furthermore, some approaches degenerate slowly when the systems operate over months and years because the locality information becomes outdated. We propose a novel approach, called Block Locality Cache (BLC), that captures the previous backup run significantly better than existing approaches and also always uses up-to-date locality information and which is, therefore, less prone to aging. We evaluate the approach using a trace-based simulation of multiple real-world backup datasets. The simulation compares the Block Locality Cache with the approach of Zhu et al. [37] and provides a detailed analysis of the behavior and IO pattern. Furthermore, a prototype implementation is used to validate the simulation.