Space savings and design considerations in variable length deduplication

  • Authors:
  • Giridhar Appaji Nag Yasa;P. C. Nagesh

  • Affiliations:
  • NetApp Inc.;NetApp Inc.

  • Venue:
  • ACM SIGOPS Operating Systems Review
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Explosion of data growth and duplication of data in enterprises has led to the deployment of a variety of deduplication technologies. However not all deduplication technologies serve the needs of every workload. Most prior research in deduplication concentrates on fixed block size (or variable block size at a fixed block boundary) deduplication which provides sub-optimal space efficiency in workloads where the duplicate data is not block aligned. Workloads also differ in the nature of operations and their priorities thereby affecting the choice of the right flavor of deduplication. Object workloads for instance, hold multiple versions of archived documents that have a high degree of duplicate data. They are also write-once read-many in nature and follow a whole object GET, PUT and DELETE model and would be better served by a deduplication strategy that takes care of nonblock aligned changes to data. In this paper, we describe and evaluate a hybrid of a variable length and block based deduplication that is hierarchical in nature. We are motivated by the following insights from real world data: (a) object workload applications do not do in-place modification of data and hence new versions of objects are written again as a whole (b) significant amount of data among different versions of the same object is shareable but the changes are usually not block aligned. While the second point is the basis for variable length technique, both the above insights motivate our hierarchical deduplication strategy. We show through experiments with production data-sets from enterprise environments that this provides up to twice the space savings compared to a fixed block deduplication.