Tradeoffs in scalable data routing for deduplication clusters

Authors:
Wei Dong;Fred Douglis;Kai Li;Hugo Patterson;Sazzala Reddy;Philip Shilane
Affiliations:
Princeton University;EMC;Princeton University and EMC;EMC;EMC;EMC
Venue:
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Year:
2011

Citing 18
Cited 19

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Farsite: federated, available, and reliable storage for an incompletely trusted environment

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Fast, inexpensive content-addressed storage in foundation

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies

A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
An efficient multi-tier tablet server storage architecture

Proceedings of the 2nd ACM Symposium on Cloud Computing
A study of practical deduplication

ACM Transactions on Storage (TOS)
File routing middleware for cloud deduplication

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Characteristics of backup workloads in production systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Content-aware load balancing for distributed backup

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Generating realistic datasets for deduplication analysis

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Reducing impact of data fragmentation caused by in-line deduplication

Proceedings of the 5th Annual International Systems and Storage Conference
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
Probabilistic deduplication for cluster-based storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
A scalable inline cluster deduplication framework for big data protection

Proceedings of the 13th International Middleware Conference
Rangoli: space management in deduplication environments

Proceedings of the 6th International Systems and Storage Conference
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Journal of Signal Processing Systems
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)
Efficiently storing virtual machine backups

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Concurrent deletion in a distributed content-addressable storage system with global deduplication

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

As data have been growing rapidly in data centers, deduplication storage systems continuously face challenges in providing the corresponding throughputs and capacities necessary to move backup data within backup and recovery window times. One approach is to build a cluster deduplication storage system with multiple deduplication storage system nodes. The goal is to achieve scalable throughput and capacity using extremely high-throughput (e.g. 1.5 GB/s) nodes, with a minimal loss of compression ratio. The key technical issue is to route data intelligently at an appropriate granularity. We present a cluster-based deduplication system that can deduplicate with high throughput, support deduplication ratios comparable to that of a single system, and maintain a low variation in the storage utilization of individual nodes. In experiments with dozens of nodes, we examine tradeoffs between stateless data routing approaches with low overhead and stateful approaches that have higher overhead but avoid imbalances that can adversely affect deduplication effectiveness for some datasets in large clusters. The stateless approach has been deployed in a two-node commercial system that achieves 3 GB/s for multi-stream deduplication throughput and currently scales to 5.6 PB of storage (assuming 20X total compression).