SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection

  • Authors:
  • Guiping Wang;Shuyu Chen;Mingwei Lin;Xiaowei Liu

  • Affiliations:
  • College of Computer Science, Chongqing University, Chongqing 400044, China;College of Software Engineering, Chongqing University, Chongqing 400044, China;College of Computer Science, Chongqing University, Chongqing 400044, China;HuaWei Research Institute, Chengdu, Sichuan 610041, China

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2014

Quantified Score

Hi-index 12.05

Visualization

Abstract

With the explosive growth of data, storage systems are facing huge storage pressure due to a mass of redundant data caused by the duplicate copies or regions of files. Data deduplication is a storage-optimization technique that reduces the data footprint by eliminating multiple copies of redundant data and storing only unique data. The basis of data deduplication is duplicate data detection techniques, which divide files into a number of parts, compare corresponding parts between files via hash techniques and find out redundant data. This paper proposes an efficient sliding blocking algorithm with backtracking sub-blocks called SBBS for duplicate data detection. SBBS improves the duplicate data detection precision of the traditional sliding blocking (SB) algorithm via backtracking the left/right 1/4 and 1/2 sub-blocks in matching-failed segments. Experimental results show that SBBS averagely improves the duplicate detection precision by 6.5% compared with the traditional SB algorithm and by 16.5% compared with content-defined chunking (CDC) algorithm, and it does not increase much extra storage overhead when SBBS divides the files into equal chunks of size 8kB.