Alternatives for detecting redundancy in storage systems data

  • Authors:
  • Calicrates Policroniades;Ian Pratt

  • Affiliations:
  • Computer Laboratory, Cambridge University, Cambridge, UK;Computer Laboratory, Cambridge University, Cambridge, UK

  • Venue:
  • ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Storage systems frequently maintain identical copies of data. Identifying such data can assist in the design of solutions in which data storage, transmission, and management are optimised. In this paper we evaluate three methods used to discover identical portions of data: whole file content hashing, fixed size blocking, and a chunking strategy that uses Rabin fingerprints to delimit content-defined data chunks. We assess how effective each of these strategies is in finding identical sections of data. In our experiments, we analysed diverse data sets from a variety of different types of storage systems including a mirrored section of sunsite.org.uk, different data profiles in the file system infrastructure of the Cambridge University Computer Laboratory, source code distribution trees, compressed data, and packed files. We report our experimental results and present a comparative analysis of these techniques. This study also shows how levels of similarity differ between data sets and file types. Finally, we discuss the advantages and disadvantages in the application of these methods in the light of our experimental results.