Avoiding the disk bottleneck in the data domain deduplication file system

Authors:
Benjamin Zhu;Kai Li;Hugo Patterson
Affiliations:
Data Domain, Inc.;Data Domain, Inc. and Princeton University;Data Domain, Inc.
Venue:
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Year:
2008

Citing 14
Cited 102

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Summary cache: a scalable wide-area Web cache sharing protocol

Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
A protocol-independent technique for eliminating redundant network traffic

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Value-based web caching

WWW '03 Proceedings of the 12th international conference on World Wide Web
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Farsite: federated, available, and reliable storage for an incompletely trusted environment

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Optimizing the migration of virtual computers

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Design, implementation, and evaluation of duplicate transfer detection in HTTP

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

Demystifying data deduplication

Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion
SCAN-Lite: enterprise-wide analysis on the cheap

Proceedings of the 4th ACM European conference on Computer systems
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
Cumulus: filesystem backup to the cloud

FAST '09 Proccedings of the 7th conference on File and storage technologies
A performance evaluation and examination of open-source erasure coding libraries for storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
The effectiveness of deduplication on virtual machine disk images

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Multi-level comparison of data deduplication in a backup scenario

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
The Raid-6 Liber8Tion Code

International Journal of High Performance Computing Applications
Cumulus: Filesystem backup to the cloud

ACM Transactions on Storage (TOS)
SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
FastAD: an authenticated directory for billions of objects

ACM SIGOPS Operating Systems Review
Using transparent compression to improve SSD-based I/O caches

Proceedings of the 5th European conference on Computer systems
Hermes: clustering users in large-scale e-mail services

Proceedings of the 1st ACM symposium on Cloud computing
I/O Deduplication: Utilizing content similarity to improve I/O performance

ACM Transactions on Storage (TOS)
A GPU accelerated storage system

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
I/O deduplication: utilizing content similarity to improve I/O performance

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Cheap and large CAMs for high performance data-intensive networked systems

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
ChunkStash: speeding up inline storage deduplication using flash memory

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Rethinking deduplication scalability

HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
FlashStore: high throughput persistent key-value store

Proceedings of the VLDB Endowment
Reliability analysis of deduplicated and erasure-coded storage

ACM SIGMETRICS Performance Evaluation Review
High throughput data redundancy removal algorithm with scalable performance

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Real-time approximate Range Motif discovery & data redundancy removal algorithm

Proceedings of the 14th International Conference on Extending Database Technology
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Capo: recapitulating storage for virtual desktops

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Leveraging value locality in optimizing NAND flash-based SSDs

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Improving throughput for small disk requests with proximal I/O

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Minimum density RAID-6 codes

ACM Transactions on Storage (TOS)
A driver-layer caching policy for removable storage devices

ACM Transactions on Storage (TOS)
PRESIDIO: A Framework for Efficient Archival Data Storage

ACM Transactions on Storage (TOS)
Anchor-driven subchunk deduplication

Proceedings of the 4th Annual International Conference on Systems and Storage
SkimpyStash: RAM space skimpy key-value store on flash-based storage

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
VMFlock: virtual machine co-migration for the cloud

Proceedings of the 20th international symposium on High performance distributed computing
Data deduplication system for supporting multi-mode

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Don't thrash: how to cache your hash on flash

HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Italian for beginners: the next steps for SLO-based management

HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
ViDeDup: an application-aware framework for video de-duplication

HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Secure deduplication on mobile devices

Proceedings of the 2011 Workshop on Open Source and Design of Communication
What's the difference?: efficient set reconciliation without prior context

Proceedings of the ACM SIGCOMM 2011 conference
Better security for deterministic public-key encryption: the auxiliary-input setting

CRYPTO'11 Proceedings of the 31st annual conference on Advances in cryptology
An efficient multi-tier tablet server storage architecture

Proceedings of the 2nd ACM Symposium on Cloud Computing
DeFFS: Duplication-eliminated flash file system

Computers and Electrical Engineering
A study of practical deduplication

ACM Transactions on Storage (TOS)
GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
File routing middleware for cloud deduplication

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Transparent Online Storage Compression at the Block-Level

ACM Transactions on Storage (TOS)
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

ACM Transactions on Storage (TOS)
Live deduplication storage of virtual machine images in an open-source cloud

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Characteristics of backup workloads in production systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Power consumption in enterprise-scale backup storage systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Content-aware load balancing for distributed backup

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Incremental deterministic public-key encryption

EUROCRYPT'12 Proceedings of the 31st Annual international conference on Theory and Applications of Cryptographic Techniques
TBF: a high-efficient query mechanism in de-duplication backup system

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
Delta compressed and deduplicated storage using stream-informed locality

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Generating realistic datasets for deduplication analysis

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Primary data deduplication-large scale study and system design

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Software persistent memory

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Don't thrash: how to cache your hash on flash

Proceedings of the VLDB Endowment
Reducing impact of data fragmentation caused by in-line deduplication

Proceedings of the 5th Annual International Systems and Storage Conference
Insights for data reduction in primary storage: a practical analysis

Proceedings of the 5th Annual International Systems and Storage Conference
Practical perfect hashing in nearly optimal space

Information Systems
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
A study on data deduplication in HPC storage systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Droplet: A Distributed Solution of Data Deduplication

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Live deduplication storage of virtual machine images in an open-source cloud

Proceedings of the 12th International Middleware Conference
Space savings and design considerations in variable length deduplication

ACM SIGOPS Operating Systems Review
A scalable inline cluster deduplication framework for big data protection

Proceedings of the 13th International Middleware Conference
Evaluating the feasibility of using memory content similarity to improve system resilience

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Data deduplication in a hybrid architecture for improving write performance

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
GPFS-SNC: an enterprise storage framework for virtual-machine clouds

IBM Journal of Research and Development
Virtual point in time access

Proceedings of the 6th International Systems and Storage Conference
Rangoli: space management in deduplication environments

Proceedings of the 6th International Systems and Storage Conference
Block locality caching for data deduplication

Proceedings of the 6th International Systems and Storage Conference
A scalable deduplication and garbage collection engine for incremental backup

Proceedings of the 6th International Systems and Storage Conference
CloudDT: efficient tape resource management using deduplication in cloud backup and archival services

Proceedings of the 8th International Conference on Network and Service Management
RevDedup: a reverse deduplication storage system optimized for reads to latest backups

Proceedings of the 4th Asia-Pacific Workshop on Systems
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Journal of Signal Processing Systems
Dynamic Synchronous/Asynchronous Replication

ACM Transactions on Storage (TOS)
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)
Content-based chunk placement scheme for decentralized deduplication on distributed file systems

ICCSA'13 Proceedings of the 13th international conference on Computational Science and Its Applications - Volume 1
Efficiently storing virtual machine backups

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Low-cost data deduplication for virtual machine backup in cloud storage

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Triple-A: a Non-SSD based autonomic all-flash array for high performance storage systems

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Improving deduplication techniques by accelerating remainder calculations

Discrete Applied Mathematics
Memory efficient sanitization of a deduplicated storage system

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Concurrent deletion in a distributed content-addressable storage system with global deduplication

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
File recipe compression in data deduplication systems

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Improving restore speed for backup systems that use inline chunk-based deduplication

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Migratory compression: coarse-grained data reordering to improve compressibility

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
A novel approach to data deduplication over the engineering-oriented cloud systems

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Disk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection to replace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact form and makes it economical to store backups on disk instead of tape. A crucial requirement for enterprise data protection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significant challenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot afford enough RAM to store an index of the stored segments and may be forced to access an on-disk index for every input segment. This paper describes three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure for identifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk locality for sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the disk accesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/sec for multi-stream throughput.