MAD2: A scalable high-throughput exact deduplication approach for network backup services

Authors:
Jiansheng Wei;Hong Jiang;Ke Zhou;Dan Feng
Affiliations:
School of Computer, Huazhong University of Science and Technology, Wuhan, China, Wuhan National Laboratory for Optoelectronics, Wuhan, China;Dept. of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA;School of Computer, Huazhong University of Science and Technology, Wuhan, China, Wuhan National Laboratory for Optoelectronics, Wuhan, China;School of Computer, Huazhong University of Science and Technology, Wuhan, China, Wuhan National Laboratory for Optoelectronics, Wuhan, China
Venue:
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Year:
2010

Citing 0
Cited 11

Anchor-driven subchunk deduplication

Proceedings of the 4th Annual International Conference on Systems and Storage
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Secure deduplication on mobile devices

Proceedings of the 2011 Workshop on Open Source and Design of Communication
ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Block locality caching for data deduplication

Proceedings of the 6th International Systems and Storage Conference
Content-based chunk placement scheme for decentralized deduplication on distributed file systems

ICCSA'13 Proceedings of the 13th international conference on Computational Science and Its Applications - Volume 1
Low-cost data deduplication for virtual machine backup in cloud storage

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Memory efficient sanitization of a deduplicated storage system

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Concurrent deletion in a distributed content-addressable storage system with global deduplication

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
File recipe compression in data deduplication systems

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
A novel approach to data deduplication over the engineering-oriented cloud systems

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deduplication has been widely used in disk-based secondary storage systems to improve space efficiency. However, there are two challenges facing scalable high-throughput deduplication storage. The first is the duplicate-lookup disk bottleneck due to the large size of data index that usually exceeds the available RAM space, which limits the deduplication throughput. The second is the storage node island effect resulting from duplicate data among multiple storage nodes that are difficult to eliminate. Existing approaches fail to completely eliminate the duplicates while simultaneously addressing the challenges. This paper proposes MAD2, a scalable high-throughput exact deduplication approach for network backup services. MAD2 eliminates duplicate data both at the file level and at the chunk level by employing four techniques to accelerate the deduplication process and evenly distribute data. First, MAD2 organizes fingerprints into a Hash Bucket Matrix (HBM), whose rows can be used to preserve the data locality in backups. Second, MAD2 uses Bloom Filter Array (BFA) as a quick index to quickly identify non-duplicate incoming data objects or indicate where to find a possible duplicate. Third, Dual Cache is integrated in MAD2 to effectively capture and exploit data locality. Finally, MAD2 employs a DHT-based Load-Balance technique to evenly distribute data objects among multiple storage nodes in their backup sequences to further enhance performance with a well-balanced load. We evaluate our MAD2 approach on the backend storage of B-Cloud, a research-oriented distributed system that provides network backup services. Experimental results show that MAD2 significantly outperforms the state-of-the-art approximate deduplication approaches in terms of deduplication efficiency, supporting a deduplication throughput of at least 100MB/s for each storage component.