Building a high-performance deduplication system

Authors:
Fanglu Guo;Petros Efstathopoulos
Affiliations:
Symantec Research Labs, Symantec Corporation, Culver City, CA;Symantec Research Labs, Symantec Corporation, Culver City, CA
Venue:
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Year:
2011

Citing 12
Cited 19

Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Farsite: federated, available, and reliable storage for an incompletely trusted environment

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
The effectiveness of deduplication on virtual machine disk images

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
A Novel Optimization Method to Improve De-duplication Storage System Performance

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
MAD2: A scalable high-throughput exact deduplication approach for network backup services

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)

File routing middleware for cloud deduplication

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Characteristics of backup workloads in production systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Delta compressed and deduplicated storage using stream-informed locality

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
Probabilistic deduplication for cluster-based storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
Horizon extender: long-term preservation of data leakage evidence in web traffic

Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security
Block locality caching for data deduplication

Proceedings of the 6th International Systems and Storage Conference
A scalable deduplication and garbage collection engine for incremental backup

Proceedings of the 6th International Systems and Storage Conference
CloudDT: efficient tape resource management using deduplication in cloud backup and archival services

Proceedings of the 8th International Conference on Network and Service Management
RevDedup: a reverse deduplication storage system optimized for reads to latest backups

Proceedings of the 4th Asia-Pacific Workshop on Systems
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Journal of Signal Processing Systems
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)
Low-cost data deduplication for virtual machine backup in cloud storage

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Memory efficient sanitization of a deduplicated storage system

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Concurrent deletion in a distributed content-addressable storage system with global deduplication

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern deduplication has become quite effective at eliminating duplicates in data, thus multiplying the effective capacity of disk-based backup systems, and enabling them as realistic tape replacements. Despite these improvements, single-node raw capacity is still mostly limited to tens or a few hundreds of terabytes, forcing users to resort to complex and costly multi-node systems, which usually only allow them to scale to singledigit petabytes. As the opportunities for deduplication efficiency optimizations become scarce, we are challenged with the task of designing deduplication systems that will effectively address the capacity, throughput, management and energy requirements of the petascale age. In this paper we present our high-performance deduplication prototype, designed from the ground up to optimize overall single-node performance, by making the best possible use of a node's resources, and achieve three important goals: scale to large capacity, provide good deduplication efficiency, and near-raw-disk throughput. Instead of trying to improve duplicate detection algorithms, we focus on system design aspects and introduce novelmechanisms--thatwe combinewith careful implementations of known system engineering techniques. In particular, we improve single-node scalability by introducing progressive sampled indexing and grouped mark-and-sweep, and also optimize throughput by utilizing an event-driven, multi-threaded client-server interaction model. Our prototype implementation is able to scale to billions of stored objects, with high throughput, and very little or no degradation of deduplication efficiency.