Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Recursive hashing functions for n-grams
ACM Transactions on Information Systems (TOIS)
A low-bandwidth network file system
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
A Digital Signature Based on a Conventional Encryption Function
CRYPTO '87 A Conference on the Theory and Applications of Cryptographic Techniques on Advances in Cryptology
Design and Implementation of a Storage Repository Using Commonality Factoring
MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Single instance storage in Windows® 2000
WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage
FAST '09 Proccedings of the 7th conference on File and storage technologies
Multi-level comparison of data deduplication in a backup scenario
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
dedupv1: Improving deduplication throughput using solid state drives (SSD)
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Venti: a new approach to archival storage
FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
iDedup: latency-aware, inline data deduplication for primary storage
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Rangoli: space management in deduplication environments
Proceedings of the 6th International Systems and Storage Conference
Hi-index | 0.00 |
Explosion of data growth and duplication of data in enterprises has led to the deployment of a variety of deduplication technologies. However not all deduplication technologies serve the needs of every workload. Most prior research in deduplication concentrates on fixed block size (or variable block size at a fixed block boundary) deduplication which provides sub-optimal space efficiency in workloads where the duplicate data is not block aligned. Workloads also differ in the nature of operations and their priorities thereby affecting the choice of the right flavor of deduplication. Object workloads for instance, hold multiple versions of archived documents that have a high degree of duplicate data. They are also write-once read-many in nature and follow a whole object GET, PUT and DELETE model and would be better served by a deduplication strategy that takes care of nonblock aligned changes to data. In this paper, we describe and evaluate a hybrid of a variable length and block based deduplication that is hierarchical in nature. We are motivated by the following insights from real world data: (a) object workload applications do not do in-place modification of data and hence new versions of objects are written again as a whole (b) significant amount of data among different versions of the same object is shareable but the changes are usually not block aligned. While the second point is the basis for variable length technique, both the above insights motivate our hierarchical deduplication strategy. We show through experiments with production data-sets from enterprise environments that this provides up to twice the space savings compared to a fixed block deduplication.