A scalable deduplication and garbage collection engine for incremental backup

Authors:
Dilip Nijagal Simha;Maohua Lu;Tzi-cker Chiueh
Affiliations:
Stony Brook University & ITRI, New York;IBM Almaden Research Labs, California;Stony Brook University & ITRI, Hsinchu, Taiwan
Venue:
Proceedings of the 6th International Systems and Storage Conference
Year:
2013

Citing 15
Cited 0

Compiler support to customize the mark and sweep algorithm

Proceedings of the 1st international symposium on Memory management
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Reclaiming Space from Duplicate Files in a Serverless Distributed File System

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Single instance storage in Windows® 2000

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Evaluating the usefulness of content addressable storage for high-performance data intensive applications

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Fast, inexpensive content-addressed storage in foundation

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Efficient detection of large-scale redundancy in enterprise file systems

ACM SIGOPS Operating Systems Review
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
Multi-level comparison of data deduplication in a backup scenario

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Venti: a new approach to archival storage

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
An update-aware storage system for low-locality update-intensive workloads

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Very large block-level data backup systems need scalable data deduplication and garbage collection techniques to make efficient use of the storage space and to minimize the performance overhead of doing so. Although the deduplication and garbage collection logic is conceptually straight-forward, their implementations pose a significant technical challenge because only a small portion of their associated data structures could fit into memory. In this paper, we describe the design, implementation and evaluation of a data deduplication and garbage collection engine called Sungem that is designed to remove duplicate blocks in incremental data backup streams. Sungem features three novel techniques to maximize the deduplication throughput without compromising the deduplication ratio. First, Sungem puts related fingerprint sequences, rather than fingerprints from the same backup stream, into the same container in order to increase the fingerprint prefetching efficiency. Second, to make the most of the memory space reserved for storing fingerprints, Sungem varies the sampling rates for fingerprint sequences based on their stability. Third, Sungem combines reference count and expiration time in a unique way to arrive at the first known incremental garbage collection algorithm whose bookkeeping overhead is proportional to the size of a disk volume's incremental backup snapshot rather than its full backup snapshot. We evaluated the Sungem prototype using a real-world data backup trace, and showed that the average throughput of Sungem is more than 200,000 fingerprint lookups per second on a standard X86 server, including the garbage collection cost.