Bimodal content defined chunking for backup streams

Authors:
Erik Kruus;Cristian Ungureanu;Cezary Dubnicki
Affiliations:
NEC Laboratories America;NEC Laboratories America;9LivesData, LLC
Venue:
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Year:
2010

Citing 17
Cited 19

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Hierarchical Bloom filter arrays (HBA): a novel, scalable metadata management system for large cluster-based storage

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
PersiFS: a versioned file system with an efficient representation

Proceedings of the twentieth ACM symposium on Operating systems principles
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Shark: scaling file servers via cooperative caching

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies

HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
ChunkStash: speeding up inline storage deduplication using flash memory

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Leveraging value locality in optimizing NAND flash-based SSDs

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Anchor-driven subchunk deduplication

Proceedings of the 4th Annual International Conference on Systems and Storage
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A study of practical deduplication

ACM Transactions on Storage (TOS)
Live deduplication storage of virtual machine images in an open-source cloud

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Characteristics of backup workloads in production systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
TBF: a high-efficient query mechanism in de-duplication backup system

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
Primary data deduplication-large scale study and system design

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Reducing impact of data fragmentation caused by in-line deduplication

Proceedings of the 5th Annual International Systems and Storage Conference
Probabilistic deduplication for cluster-based storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
Live deduplication storage of virtual machine images in an open-source cloud

Proceedings of the 12th International Middleware Conference
RevDedup: a reverse deduplication storage system optimized for reads to latest backups

Proceedings of the 4th Asia-Pacific Workshop on Systems
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data deduplication has become a popular technology for reducing the amount of storage space necessary for backup and archival data. Content defined chunking (CDC) techniques are well established methods of separating a data stream into variable-size chunks such that duplicate content has a good chance of being discovered irrespective of its position in the data stream. Requirements for CDC include fast and scalable operation, as well as achieving good duplicate elimination. While the latter can be achieved by using chunks of small average size, this also increases the amount of metadata necessary to store the relatively more numerous chunks, and impacts negatively the system's performance. We propose a new approach that achieves comparable duplicate elimination while using chunks of larger average size. It involves using two chunk size targets, and mechanisms that dynamically switch between the two based on querying data already stored; we use small chunks in limited regions of transition from duplicate to nonduplicate data, and elsewhere we use large chunks. The algorithms rely on the block store's ability to quickly deliver a high-quality reply to existence queries for already-stored blocks. A chunking decision is made with limited lookahead and number of queries. We present results of running these algorithms on actual backup data, as well as four sets of source code archives. Our algorithms typically achieve similar duplicate elimination to standard algorithms while using chunks 2-4 times as large. Such approaches may be particularly interesting to distributed storage systems that use redundancy techniques (such as error-correcting codes) requiring multiple chunk fragments, for which metadata overheads per stored chunk are high. We find that algorithm variants with more flexibility in location and size of chunks yield better duplicate elimination, at a cost of a higher number of existence queries.