A protocol-independent technique for eliminating redundant network traffic
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
A low-bandwidth network file system
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Deep Store: An Archival Storage System Architecture
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Pastiche: making backup cheap and easy
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Efficient file storage using content-based indexing
Proceedings of the twentieth ACM symposium on Operating systems principles
PersiFS: a versioned file system with an efficient representation
Proceedings of the twentieth ACM symposium on Operating systems principles
Providing High Reliability in a Minimum Redundancy Archival Storage System
MASCOTS '06 Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation
Improving duplicate elimination in storage systems
ACM Transactions on Storage (TOS)
Redundancy elimination within large collections of files
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Alternatives for detecting redundancy in storage systems data
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Shark: scaling file servers via cooperative caching
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Supporting practical content-addressable caching with CZIP compression
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Fast, inexpensive content-addressed storage in foundation
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Demystifying data deduplication
Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage
FAST '09 Proccedings of the 7th conference on File and storage technologies
The design of a similarity based deduplication system
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Multi-level comparison of data deduplication in a backup scenario
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Efficient locally trackable deduplication in replicated systems
Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Bimodal content defined chunking for backup streams
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
MAD2: A scalable high-throughput exact deduplication approach for network backup services
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Reducing impact of data fragmentation caused by in-line deduplication
Proceedings of the 5th Annual International Systems and Storage Conference
Hi-index | 0.00 |
Data deduplication, implemented usually with content defined chunking (CDC), is today one of key features of advanced storage systems providing space for backup applications. Although simple and effective, CDC generates chunks with sizes clustered around expected chunk size, which is globally fixed for a given storage system and applies to all backups. This creates opportunity for improvement, as the optimal chunk size for deduplication varies not only among backup datasets, but also within one dataset: long stretches of unchanged data favor larger chunks, whereas regions of change prefer smaller ones. In this work, we present a new algorithm which deduplicates with big chunks as well as with their subchunks using a deduplication context containing subchunk-to-container-chunk mappings. When writing data, this context is constructed on-the fly with so-called anchor sequences defined as short sequences of adjacent chunks in a backup stream (a stream of data produced by backup application containing backed up files). For each anchor sequence, we generate an anchor -- a special block with set of mappings covering a contiguous region of the backup stream positioned ahead of this anchor sequence. If anchor sequences have not changed between backups, the mappings created with the previous backup are prefetched and added to the deduplication context. It is of limited size and fits in the main memory unlike solutions which require keeping all subchunk mappings for the entire backup stream. At the same time, the context provides most of mappings needed for subchunk deduplication. Compared to CDC, the new algorithm results in up to 25% dedup ratio improvement achieved with almost 3 times larger average block size, as verified by simulations driven by real backup traces.