iDedup: latency-aware, inline data deduplication for primary storage

Authors:
Kiran Srinivasan;Tim Bisson;Garth Goodson;Kaladhar Voruganti
Affiliations:
NetApp, Inc.;NetApp, Inc.;NetApp, Inc.;NetApp, Inc.
Venue:
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Year:
2012

Citing 20
Cited 16

On-line data compression in a log-structured file system

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The design and implementation of a log-structured file system

ACM Transactions on Computer Systems (TOCS)
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Fast, inexpensive content-addressed storage in foundation

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Measurement and analysis of large-scale network file system workloads

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
I/O deduplication: utilizing content similarity to improve I/O performance

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
ChunkStash: speeding up inline storage deduplication using flash memory

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
dedupv1: Improving deduplication throughput using solid state drives (SSD)

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Mixing Deduplication and Compression on Active Data Sets

DCC '11 Proceedings of the 2011 Data Compression Conference
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference

Primary data deduplication-large scale study and system design

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Insights for data reduction in primary storage: a practical analysis

Proceedings of the 5th Annual International Systems and Storage Conference
Systems research and innovation in data ONTAP

ACM SIGOPS Operating Systems Review
Space savings and design considerations in variable length deduplication

ACM SIGOPS Operating Systems Review
Block locality caching for data deduplication

Proceedings of the 6th International Systems and Storage Conference
Virtualize storage, not disks

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
RevDedup: a reverse deduplication storage system optimized for reads to latest backups

Proceedings of the 4th Asia-Pacific Workshop on Systems
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)
Leveraging data deduplication to improve the performance of primary storage systems in the cloud

Proceedings of the 4th annual Symposium on Cloud Computing
DEDIS: distributed exact deduplication for primary storage infrastructures

Proceedings of the 4th annual Symposium on Cloud Computing
Content-based chunk placement scheme for decentralized deduplication on distributed file systems

ICCSA'13 Proceedings of the 13th international conference on Computational Science and Its Applications - Volume 1
Low-cost data deduplication for virtual machine backup in cloud storage

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
CareDedup: cache-aware deduplication for reading performance optimization in primary storage

Proceedings Demo & Poster Track of ACM/IFIP/USENIX International Middleware Conference
Memory efficient sanitization of a deduplicated storage system

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Improving restore speed for backup systems that use inline chunk-based deduplication

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
To Zip or not to Zip: effective resource usage for real-time compression

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deduplication technologies are increasingly being deployed to reduce cost and increase space-efficiency in corporate data centers. However, prior research has not applied deduplication techniques inline to the request path for latency sensitive, primary workloads. This is primarily due to the extra latency these techniques introduce. Inherently, deduplicating data on disk causes fragmentation that increases seeks for subsequent sequential reads of the same data, thus, increasing latency. In addition, deduplicating data requires extra disk IOs to access on-disk deduplication metadata. In this paper, we propose an inline deduplication solution, iDedup, for primary workloads, while minimizing extra IOs and seeks. Our algorithm is based on two key insights from real-world workloads: i) spatial locality exists in duplicated primary data; and ii) temporal locality exists in the access patterns of duplicated data. Using the first insight, we selectively deduplicate only sequences of disk blocks. This reduces fragmentation and amortizes the seeks caused by deduplication. The second insight allows us to replace the expensive, on-disk, deduplication metadata with a smaller, in-memory cache. These techniques enable us to tradeoff capacity savings for performance, as demonstrated in our evaluation with real-world workloads. Our evaluation shows that iDedup achieves 60-70% of the maximum deduplication with less than a 5% CPU overhead and a 2-4% latency impact.