Droplet: A Distributed Solution of Data Deduplication

Authors:
Yang Zhang;Yongwei Wu;Guangwen Yang
Affiliations:
-;-;-
Venue:
GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Year:
2012

Citing 16
Cited 0

Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Memory resource management in VMware ESX server

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Cuckoo hashing

Journal of Algorithms
Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
The effectiveness of deduplication on virtual machine disk images

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
The case for RAMClouds: scalable high-performance storage entirely in DRAM

ACM SIGOPS Operating Systems Review
Difference engine: harnessing memory redundancy in virtual machines

Communications of the ACM
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
ChunkStash: speeding up inline storage deduplication using flash memory

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Experiences with content addressable storage and virtual disks

WIOV'08 Proceedings of the First conference on I/O virtualization
Finding a needle in Haystack: facebook's photo storage

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Venti: a new approach to archival storage

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Creating backup copies is the most commonly used technique to protect from data loss. In order to increase reliability, doing routinely backup is a best practice. Such backup activities will create multiple redundant data streams which is not economic to be directly stored on disk. Similarly, enterprise archival systems usually deal with redundant data, which needs to be stored for later accessing. Deduplication is an essential technique used under these situations, which could avoid storing identical data segments, and thus saves a significant portion of disk usage. Also, recent studies have shown that deduplication could also effectively reduce the disk space used to store virtual machine (VM) disk images. We present droplet, a distributed deduplication storage system that has been designed for high throughput and scalability. Droplet strips input data streams onto multiple storage nodes, thus limits number of stored data segments on each node and ensures the fingerprint index could be fitted into memory. The in-memory finger index avoids the disk bottleneck discussed in Data Domain, ChunkStash and provides excellent lookup performance. The buffering layer in droplet provides good write performance for small data segments. Compression on date segments reduces disk usage one step further.