Anchor-driven subchunk deduplication

Authors:
Bartłomiej Romański;Łukasz Heldt;Wojciech Kilian;Krzysztof Lichota;Cezary Dubnicki
Affiliations:
9LivesData, LLC, Warsaw, Poland;9LivesData, LLC, Warsaw, Poland;9LivesData, LLC, Warsaw, Poland;9LivesData, LLC, Warsaw, Poland;9LivesData, LLC, Warsaw, Poland
Venue:
Proceedings of the 4th Annual International Conference on Systems and Storage
Year:
2011

Citing 24
Cited 1

A protocol-independent technique for eliminating redundant network traffic

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Efficient file storage using content-based indexing

Proceedings of the twentieth ACM symposium on Operating systems principles
PersiFS: a versioned file system with an efficient representation

Proceedings of the twentieth ACM symposium on Operating systems principles
Providing High Reliability in a Minimum Redundancy Archival Storage System

MASCOTS '06 Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Shark: scaling file servers via cooperative caching

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Supporting practical content-addressable caching with CZIP compression

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Evaluating the usefulness of content addressable storage for high-performance data intensive applications

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Fast, inexpensive content-addressed storage in foundation

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Demystifying data deduplication

Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Multi-level comparison of data deduplication in a backup scenario

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Efficient locally trackable deduplication in replicated systems

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
MAD2: A scalable high-throughput exact deduplication approach for network backup services

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)

Reducing impact of data fragmentation caused by in-line deduplication

Proceedings of the 5th Annual International Systems and Storage Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data deduplication, implemented usually with content defined chunking (CDC), is today one of key features of advanced storage systems providing space for backup applications. Although simple and effective, CDC generates chunks with sizes clustered around expected chunk size, which is globally fixed for a given storage system and applies to all backups. This creates opportunity for improvement, as the optimal chunk size for deduplication varies not only among backup datasets, but also within one dataset: long stretches of unchanged data favor larger chunks, whereas regions of change prefer smaller ones. In this work, we present a new algorithm which deduplicates with big chunks as well as with their subchunks using a deduplication context containing subchunk-to-container-chunk mappings. When writing data, this context is constructed on-the fly with so-called anchor sequences defined as short sequences of adjacent chunks in a backup stream (a stream of data produced by backup application containing backed up files). For each anchor sequence, we generate an anchor -- a special block with set of mappings covering a contiguous region of the backup stream positioned ahead of this anchor sequence. If anchor sequences have not changed between backups, the mappings created with the previous backup are prefetched and added to the deduplication context. It is of limited size and fits in the main memory unlike solutions which require keeping all subchunk mappings for the entire backup stream. At the same time, the context provides most of mappings needed for subchunk deduplication. Compared to CDC, the new algorithm results in up to 25% dedup ratio improvement achieved with almost 3 times larger average block size, as verified by simulations driven by real backup traces.