SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

  • Authors:
  • Wen Xia;Hong Jiang;Dan Feng;Yu Hua

  • Affiliations:
  • School of Computer, Huazhong University of Science and Technology, Wuhan, China and Wuhan National Lab for Optoelectronics, Wuhan, China;Dept. of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE;School of Computer, Huazhong University of Science and Technology, Wuhan, China and Wuhan National Lab for Optoelectronics, Wuhan, China;School of Computer, Huazhong University of Science and Technology, Wuhan, China and Wuhan National Lab for Optoelectronics, Wuhan, China and Dept. of Computer Science and Engineering, University o ...

  • Venue:
  • USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data Deduplication is becoming increasingly popular in storage systems as a space-efficient approach to data backup and archiving. Most existing state-of-the-art deduplication methods are either locality based or similarity based, which, according to our analysis, do not work adequately in many situations. While the former produces poor deduplication throughput when there is little or no locality in datasets, the latter can fail to identify and thus remove significant amounts of redundant data when there is a lack of similarity among files. In this paper, we present SiLo, a near-exact deduplication system that effectively and complementarily exploits similarity and locality to achieve high duplicate elimination and throughput at extremely low RAM overheads. The main idea behind SiLo is to expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files, and to leverage locality in the backup stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. By judiciously enhancing similarity through the exploitation of locality and vice versa, the SiLo approach is able to significantly reduce RAM usage for index-lookup and maintain a very high deduplication throughput. Our experimental evaluation of SiLo based on real-world datasets shows that the SiLo system consistently and significantly outperforms two existing state-of-the-art system, one based on similarity and the other based on locality, under various workload conditions.