Decentralized deduplication in SAN cluster file systems

Authors:
Austin T. Clements;Irfan Ahmad;Murali Vilayannur;Jinyuan Li
Affiliations:
MIT CSAIL;VMware, Inc.;VMware, Inc.;VMware, Inc.
Venue:
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Year:
2009

Citing 14
Cited 38

Extendible hashing—a fast access method for dynamic files

ACM Transactions on Database Systems (TODS)
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Reclaiming Space from Duplicate Files in a Serverless Distributed File System

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
IBM Storage Tank-- A heterogeneous scalable SAN file system

IBM Systems Journal
Memory resource management in VMware ESX server

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Providing tunable consistency for a parallel file store

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Single instance storage in Windows® 2000

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Evaluating the usefulness of content addressable storage for high-performance data intensive applications

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Fast, inexpensive content-addressed storage in foundation

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
Experiences with content addressable storage and virtual disks

WIOV'08 Proceedings of the First conference on I/O virtualization

Lithium: virtual machine storage for the cloud

Proceedings of the 1st ACM symposium on Cloud computing
I/O Deduplication: Utilizing content similarity to improve I/O performance

ACM Transactions on Storage (TOS)
Storage deduplication for Virtual Ad Hoc Network testbed by File-level Block Sharing

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Tracking back references in a write-anywhere file system

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
I/O deduplication: utilizing content similarity to improve I/O performance

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
ChunkStash: speeding up inline storage deduplication using flash memory

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
mClock: handling throughput variability for hypervisor IO scheduling

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Fast and secure laptop backups with encrypted de-duplication

LISA'10 Proceedings of the 24th international conference on Large installation system administration
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Capo: recapitulating storage for virtual desktops

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Data deduplication system for supporting multi-mode

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Energy efficient file transfer mechanism using deduplication scheme

ICHIT'11 Proceedings of the 5th international conference on Convergence and hybrid information technology
How to tell if your cloud files are vulnerable to drive crashes

Proceedings of the 18th ACM conference on Computer and communications security
DeFFS: Duplication-eliminated flash file system

Computers and Electrical Engineering
A study of practical deduplication

ACM Transactions on Storage (TOS)
An empirical analysis of similarity in virtual machine images

Proceedings of the Middleware 2011 Industry Track Workshop
Modeling virtualized applications using machine learning techniques

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Live deduplication storage of virtual machine images in an open-source cloud

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
VM aware journaling: improving journaling file system performance in virtualization environments

Software—Practice & Experience
Demand based hierarchical QoS using storage resource pools

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Primary data deduplication-large scale study and system design

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
A study on data deduplication in HPC storage systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Droplet: A Distributed Solution of Data Deduplication

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Live deduplication storage of virtual machine images in an open-source cloud

Proceedings of the 12th International Middleware Conference
Data deduplication using dynamic chunking algorithm

ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part II
GPFS-SNC: an enterprise storage framework for virtual-machine clouds

IBM Journal of Research and Development
Block locality caching for data deduplication

Proceedings of the 6th International Systems and Storage Conference
A scalable deduplication and garbage collection engine for incremental backup

Proceedings of the 6th International Systems and Storage Conference
RevDedup: a reverse deduplication storage system optimized for reads to latest backups

Proceedings of the 4th Asia-Pacific Workshop on Systems
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)
DEDIS: distributed exact deduplication for primary storage infrastructures

Proceedings of the 4th annual Symposium on Cloud Computing
Content-based chunk placement scheme for decentralized deduplication on distributed file systems

ICCSA'13 Proceedings of the 13th international conference on Computational Science and Its Applications - Volume 1
Low-cost data deduplication for virtual machine backup in cloud storage

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Memory efficient sanitization of a deduplicated storage system

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Concurrent deletion in a distributed content-addressable storage system with global deduplication

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
A novel approach to data deduplication over the engineering-oriented cloud systems

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

File systems hosting virtual machines typically contain many duplicated blocks of data resulting in wasted storage space and increased storage array cache footprint. Deduplication addresses these problems by storing a single instance of each unique data block and sharing it between all original sources of that data. While deduplication is well understood for file systems with a centralized component, we investigate it in a decentralized cluster file system, specifically in the context of VM storage. We propose DEDE, a block-level deduplication system for live cluster file systems that does not require any central coordination, tolerates host failures, and takes advantage of the block layout policies of an existing cluster file system. In DEDE, hosts keep summaries of their own writes to the cluster file system in shared on-disk logs. Each host periodically and independently processes the summaries of its locked files, merges them with a shared index of blocks, and reclaims any duplicate blocks. DEDE manipulates metadata using general file system interfaces without knowledge of the file system implementation. We present the design, implementation, and evaluation of our techniques in the context of VMware ESX Server. Our results show an 80% reduction in space with minor performance overhead for realistic workloads.