Compiler support to customize the mark and sweep algorithm
Proceedings of the 1st international symposium on Memory management
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Reclaiming Space from Duplicate Files in a Serverless Distributed File System
ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Single instance storage in Windows® 2000
WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Fast, inexpensive content-addressed storage in foundation
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Efficient detection of large-scale redundancy in enterprise file systems
ACM SIGOPS Operating Systems Review
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage
FAST '09 Proccedings of the 7th conference on File and storage technologies
Multi-level comparison of data deduplication in a backup scenario
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Decentralized deduplication in SAN cluster file systems
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Venti: a new approach to archival storage
FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Building a high-performance deduplication system
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
An update-aware storage system for low-locality update-intensive workloads
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Hi-index | 0.00 |
Very large block-level data backup systems need scalable data deduplication and garbage collection techniques to make efficient use of the storage space and to minimize the performance overhead of doing so. Although the deduplication and garbage collection logic is conceptually straight-forward, their implementations pose a significant technical challenge because only a small portion of their associated data structures could fit into memory. In this paper, we describe the design, implementation and evaluation of a data deduplication and garbage collection engine called Sungem that is designed to remove duplicate blocks in incremental data backup streams. Sungem features three novel techniques to maximize the deduplication throughput without compromising the deduplication ratio. First, Sungem puts related fingerprint sequences, rather than fingerprints from the same backup stream, into the same container in order to increase the fingerprint prefetching efficiency. Second, to make the most of the memory space reserved for storing fingerprints, Sungem varies the sampling rates for fingerprint sequences based on their stability. Third, Sungem combines reference count and expiration time in a unique way to arrive at the first known incremental garbage collection algorithm whose bookkeeping overhead is proportional to the size of a disk volume's incremental backup snapshot rather than its full backup snapshot. We evaluated the Sungem prototype using a real-world data backup trace, and showed that the average throughput of Sungem is more than 200,000 fingerprint lookups per second on a standard X86 server, including the garbage collection cost.