Sequence of Hashes Compression in Data De-duplication

  • Authors:
  • Subashini Balachandran;Cornel Constantinescu

  • Affiliations:
  • -;-

  • Venue:
  • DCC '08 Proceedings of the Data Compression Conference
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data de-duplication is a simple compression method that became verypopular in storage archival and backup. It has the advantage ofdirect, random access to any piece ("chunk") of a file in one tablelookup; that's not the case with differential file compression, theother common storage archival method. The compression efficiency(chunk matching) of de-duplication improves for smaller chunk sizes,however the sequence of hashes replacing the de-duplicated object(file) increases significantly. We propose a simple scheme to shrinkthe list of hashes generated during de-duplication of an object.This shrinkage is orders of magnitude smaller than what a customarycompression algorithm (gzip) achieves and has a significant impacton overall de-duplication efficiency.