SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

Authors:
Wen Xia;Hong Jiang;Dan Feng;Yu Hua
Affiliations:
School of Computer, Huazhong University of Science and Technology, Wuhan, China and Wuhan National Lab for Optoelectronics, Wuhan, China;Dept. of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE;School of Computer, Huazhong University of Science and Technology, Wuhan, China and Wuhan National Lab for Optoelectronics, Wuhan, China;School of Computer, Huazhong University of Science and Technology, Wuhan, China and Wuhan National Lab for Optoelectronics, Wuhan, China and Dept. of Computer Science and Engineering, University o ...
Venue:
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Year:
2011

Citing 21
Cited 10

A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A five-year study of file-system metadata

ACM Transactions on Storage (TOS)
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
Cumulus: filesystem backup to the cloud

FAST '09 Proccedings of the 7th conference on File and storage technologies
The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
The effectiveness of deduplication on virtual machine disk images

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
I/O Deduplication: Utilizing content similarity to improve I/O performance

ACM Transactions on Storage (TOS)
A New Buffer Cache Design Exploiting Both Temporal and Content Localities

ICDCS '10 Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
ChunkStash: speeding up inline storage deduplication using flash memory

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Leveraging value locality in optimizing NAND flash-based SSDs

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies

WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Generating realistic datasets for deduplication analysis

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
Space savings and design considerations in variable length deduplication

ACM SIGOPS Operating Systems Review
A scalable inline cluster deduplication framework for big data protection

Proceedings of the 13th International Middleware Conference
Block locality caching for data deduplication

Proceedings of the 6th International Systems and Storage Conference
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Journal of Signal Processing Systems
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data Deduplication is becoming increasingly popular in storage systems as a space-efficient approach to data backup and archiving. Most existing state-of-the-art deduplication methods are either locality based or similarity based, which, according to our analysis, do not work adequately in many situations. While the former produces poor deduplication throughput when there is little or no locality in datasets, the latter can fail to identify and thus remove significant amounts of redundant data when there is a lack of similarity among files. In this paper, we present SiLo, a near-exact deduplication system that effectively and complementarily exploits similarity and locality to achieve high duplicate elimination and throughput at extremely low RAM overheads. The main idea behind SiLo is to expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files, and to leverage locality in the backup stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. By judiciously enhancing similarity through the exploitation of locality and vice versa, the SiLo approach is able to significantly reduce RAM usage for index-lookup and maintain a very high deduplication throughput. Our experimental evaluation of SiLo based on real-world datasets shows that the SiLo system consistently and significantly outperforms two existing state-of-the-art system, one based on similarity and the other based on locality, under various workload conditions.