Content-based chunk placement scheme for decentralized deduplication on distributed file systems

Authors:
Keonwoo Kim;Jeehong Kim;Changwoo Min;Young Ik Eom
Affiliations:
College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea;College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea;College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea,Samsung Electronics Co., Ltd., Suwon, Korea;College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea
Venue:
ICCSA'13 Proceedings of the 13th international conference on Computational Science and Its Applications - Volume 1
Year:
2013

Citing 17
Cited 0

Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Web caching with consistent hashing

WWW '99 Proceedings of the eighth international conference on World Wide Web
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Ceph: a scalable, high-performance distributed file system

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
A Taxonomy and Survey on Distributed File Systems

NCM '08 Proceedings of the 2008 Fourth International Conference on Networked Computing and Advanced Information Management - Volume 01
The XtreemFS architecture—a case for object-based file systems in Grids

Concurrency and Computation: Practice & Experience - Selection of Best Papers of the VLDB Data Management in Grids Workshop (VLDB DMG 2007)
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
MAD2: A scalable high-throughput exact deduplication approach for network backup services

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Data deduplication with Linux

Linux Journal
A study of practical deduplication

ACM Transactions on Storage (TOS)
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Primary data deduplication-large scale study and system design

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

The rapid growth of data size causes several problems such as storage limitation and increment of data management cost. In order to store and manage massive data, Distributed File System (DFS) is widely used. Furthermore, in order to reduce the volume of storage, data deduplication schemes are being extensively studied. The data deduplication increases the available storage capacity by eliminating duplicated data. However, deduplication process causes performance overhead such as disk I/O. In this paper, we propose a content-based chunk placement scheme to increase deduplication rate on the DFS. To avoid performance overhead caused by deduplication process, we use lessfs in each chunk server. With our design, our system performs decentralized deduplication process in each chunk server. Moreover, we use consistent hashing for chunk allocation and failure recovery. Our experimental results show that the proposed system reduces the storage space by 60% than the system without consistent hashing.