Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Farsite: federated, available, and reliable storage for an incompletely trusted environment
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Finding similar files in large document repositories
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Redundancy elimination within large collections of files
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage
FAST '09 Proccedings of the 7th conference on File and storage technologies
The effectiveness of deduplication on virtual machine disk images
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
A Novel Optimization Method to Improve De-duplication Storage System Performance
ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Bimodal content defined chunking for backup streams
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
MAD2: A scalable high-throughput exact deduplication approach for network backup services
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
File routing middleware for cloud deduplication
Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Characteristics of backup workloads in production systems
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
WAN optimized replication of backup datasets using stream-informed delta compression
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Shredder: GPU-accelerated incremental storage and computation
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Delta compressed and deduplicated storage using stream-informed locality
HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
WAN-optimized replication of backup datasets using stream-informed delta compression
ACM Transactions on Storage (TOS)
Probabilistic deduplication for cluster-based storage systems
Proceedings of the Third ACM Symposium on Cloud Computing
Horizon extender: long-term preservation of data leakage evidence in web traffic
Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security
Block locality caching for data deduplication
Proceedings of the 6th International Systems and Storage Conference
A scalable deduplication and garbage collection engine for incremental backup
Proceedings of the 6th International Systems and Storage Conference
Proceedings of the 8th International Conference on Network and Service Management
RevDedup: a reverse deduplication storage system optimized for reads to latest backups
Proceedings of the 4th Asia-Pacific Workshop on Systems
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services
Journal of Signal Processing Systems
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud
ACM Transactions on Storage (TOS)
Low-cost data deduplication for virtual machine backup in cloud storage
HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Memory efficient sanitization of a deduplicated storage system
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Concurrent deletion in a distributed content-addressable storage system with global deduplication
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
Modern deduplication has become quite effective at eliminating duplicates in data, thus multiplying the effective capacity of disk-based backup systems, and enabling them as realistic tape replacements. Despite these improvements, single-node raw capacity is still mostly limited to tens or a few hundreds of terabytes, forcing users to resort to complex and costly multi-node systems, which usually only allow them to scale to singledigit petabytes. As the opportunities for deduplication efficiency optimizations become scarce, we are challenged with the task of designing deduplication systems that will effectively address the capacity, throughput, management and energy requirements of the petascale age. In this paper we present our high-performance deduplication prototype, designed from the ground up to optimize overall single-node performance, by making the best possible use of a node's resources, and achieve three important goals: scale to large capacity, provide good deduplication efficiency, and near-raw-disk throughput. Instead of trying to improve duplicate detection algorithms, we focus on system design aspects and introduce novelmechanisms--thatwe combinewith careful implementations of known system engineering techniques. In particular, we improve single-node scalability by introducing progressive sampled indexing and grouped mark-and-sweep, and also optimize throughput by utilizing an event-driven, multi-threaded client-server interaction model. Our prototype implementation is able to scale to billions of stored objects, with high throughput, and very little or no degradation of deduplication efficiency.