Anchor-driven subchunk deduplication
Proceedings of the 4th Annual International Conference on Systems and Storage
Building a high-performance deduplication system
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Secure deduplication on mobile devices
Proceedings of the 2011 Workshop on Open Source and Design of Communication
ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Block locality caching for data deduplication
Proceedings of the 6th International Systems and Storage Conference
Content-based chunk placement scheme for decentralized deduplication on distributed file systems
ICCSA'13 Proceedings of the 13th international conference on Computational Science and Its Applications - Volume 1
Low-cost data deduplication for virtual machine backup in cloud storage
HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Memory efficient sanitization of a deduplicated storage system
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Concurrent deletion in a distributed content-addressable storage system with global deduplication
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
File recipe compression in data deduplication systems
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
A novel approach to data deduplication over the engineering-oriented cloud systems
Integrated Computer-Aided Engineering
Hi-index | 0.00 |
Deduplication has been widely used in disk-based secondary storage systems to improve space efficiency. However, there are two challenges facing scalable high-throughput deduplication storage. The first is the duplicate-lookup disk bottleneck due to the large size of data index that usually exceeds the available RAM space, which limits the deduplication throughput. The second is the storage node island effect resulting from duplicate data among multiple storage nodes that are difficult to eliminate. Existing approaches fail to completely eliminate the duplicates while simultaneously addressing the challenges. This paper proposes MAD2, a scalable high-throughput exact deduplication approach for network backup services. MAD2 eliminates duplicate data both at the file level and at the chunk level by employing four techniques to accelerate the deduplication process and evenly distribute data. First, MAD2 organizes fingerprints into a Hash Bucket Matrix (HBM), whose rows can be used to preserve the data locality in backups. Second, MAD2 uses Bloom Filter Array (BFA) as a quick index to quickly identify non-duplicate incoming data objects or indicate where to find a possible duplicate. Third, Dual Cache is integrated in MAD2 to effectively capture and exploit data locality. Finally, MAD2 employs a DHT-based Load-Balance technique to evenly distribute data objects among multiple storage nodes in their backup sequences to further enhance performance with a well-balanced load. We evaluate our MAD2 approach on the backend storage of B-Cloud, a research-oriented distributed system that provides network backup services. Experimental results show that MAD2 significantly outperforms the state-of-the-art approximate deduplication approaches in terms of deduplication efficiency, supporting a deduplication throughput of at least 100MB/s for each storage component.