Demystifying data deduplication

  • Authors:
  • Nagapramod Mandagere;Pin Zhou;Mark A Smith;Sandeep Uttamchandani

  • Affiliations:
  • University of Minnesota;IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center

  • Venue:
  • Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

Effectiveness and tradeoffs of deduplication technologies are not well understood -- vendors tout Deduplication as a "silver bullet" that can help any enterprise optimize its deployed storage capacity. This paper aims to provide a comprehensive taxonomy and experimental evaluation using real-world data. While the rate of change of data on a day-to-day basis has the greatest influence on the duplication in backup data, we investigate the duplication inherent in this data, independent of rate of change of data or backup schedule or backup algorithm used. Our experimental results show that between different deduplication techniques the space savings varies by about 30%, the CPU usage differs by almost 6 times and the time to reconstruct a deduplicated file can vary by more than 15 times.