MAD2: A scalable high-throughput exact deduplication approach for network backup services

  • Authors:
  • Jiansheng Wei;Hong Jiang;Ke Zhou;Dan Feng

  • Affiliations:
  • School of Computer, Huazhong University of Science and Technology, Wuhan, China, Wuhan National Laboratory for Optoelectronics, Wuhan, China;Dept. of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA;School of Computer, Huazhong University of Science and Technology, Wuhan, China, Wuhan National Laboratory for Optoelectronics, Wuhan, China;School of Computer, Huazhong University of Science and Technology, Wuhan, China, Wuhan National Laboratory for Optoelectronics, Wuhan, China

  • Venue:
  • MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Deduplication has been widely used in disk-based secondary storage systems to improve space efficiency. However, there are two challenges facing scalable high-throughput deduplication storage. The first is the duplicate-lookup disk bottleneck due to the large size of data index that usually exceeds the available RAM space, which limits the deduplication throughput. The second is the storage node island effect resulting from duplicate data among multiple storage nodes that are difficult to eliminate. Existing approaches fail to completely eliminate the duplicates while simultaneously addressing the challenges. This paper proposes MAD2, a scalable high-throughput exact deduplication approach for network backup services. MAD2 eliminates duplicate data both at the file level and at the chunk level by employing four techniques to accelerate the deduplication process and evenly distribute data. First, MAD2 organizes fingerprints into a Hash Bucket Matrix (HBM), whose rows can be used to preserve the data locality in backups. Second, MAD2 uses Bloom Filter Array (BFA) as a quick index to quickly identify non-duplicate incoming data objects or indicate where to find a possible duplicate. Third, Dual Cache is integrated in MAD2 to effectively capture and exploit data locality. Finally, MAD2 employs a DHT-based Load-Balance technique to evenly distribute data objects among multiple storage nodes in their backup sequences to further enhance performance with a well-balanced load. We evaluate our MAD2 approach on the backend storage of B-Cloud, a research-oriented distributed system that provides network backup services. Experimental results show that MAD2 significantly outperforms the state-of-the-art approximate deduplication approaches in terms of deduplication efficiency, supporting a deduplication throughput of at least 100MB/s for each storage component.