On-line data compression in a log-structured file system
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The design and implementation of a log-structured file system
ACM Transactions on Computer Systems (TOCS)
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Deep Store: An Archival Storage System Architecture
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
TAPER: tiered approach for eliminating redundancy in replica synchronization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Fast, inexpensive content-addressed storage in foundation
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Measurement and analysis of large-scale network file system workloads
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage
FAST '09 Proccedings of the 7th conference on File and storage technologies
I/O deduplication: utilizing content similarity to improve I/O performance
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
ChunkStash: speeding up inline storage deduplication using flash memory
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
dedupv1: Improving deduplication throughput using solid state drives (SSD)
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
A study of practical deduplication
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Mixing Deduplication and Compression on Active Data Sets
DCC '11 Proceedings of the 2011 Data Compression Conference
Building a high-performance deduplication system
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Primary data deduplication-large scale study and system design
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Insights for data reduction in primary storage: a practical analysis
Proceedings of the 5th Annual International Systems and Storage Conference
Systems research and innovation in data ONTAP
ACM SIGOPS Operating Systems Review
Space savings and design considerations in variable length deduplication
ACM SIGOPS Operating Systems Review
Block locality caching for data deduplication
Proceedings of the 6th International Systems and Storage Conference
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
RevDedup: a reverse deduplication storage system optimized for reads to latest backups
Proceedings of the 4th Asia-Pacific Workshop on Systems
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud
ACM Transactions on Storage (TOS)
Leveraging data deduplication to improve the performance of primary storage systems in the cloud
Proceedings of the 4th annual Symposium on Cloud Computing
DEDIS: distributed exact deduplication for primary storage infrastructures
Proceedings of the 4th annual Symposium on Cloud Computing
Content-based chunk placement scheme for decentralized deduplication on distributed file systems
ICCSA'13 Proceedings of the 13th international conference on Computational Science and Its Applications - Volume 1
Low-cost data deduplication for virtual machine backup in cloud storage
HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
CareDedup: cache-aware deduplication for reading performance optimization in primary storage
Proceedings Demo & Poster Track of ACM/IFIP/USENIX International Middleware Conference
Memory efficient sanitization of a deduplicated storage system
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Improving restore speed for backup systems that use inline chunk-based deduplication
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
To Zip or not to Zip: effective resource usage for real-time compression
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
Deduplication technologies are increasingly being deployed to reduce cost and increase space-efficiency in corporate data centers. However, prior research has not applied deduplication techniques inline to the request path for latency sensitive, primary workloads. This is primarily due to the extra latency these techniques introduce. Inherently, deduplicating data on disk causes fragmentation that increases seeks for subsequent sequential reads of the same data, thus, increasing latency. In addition, deduplicating data requires extra disk IOs to access on-disk deduplication metadata. In this paper, we propose an inline deduplication solution, iDedup, for primary workloads, while minimizing extra IOs and seeks. Our algorithm is based on two key insights from real-world workloads: i) spatial locality exists in duplicated primary data; and ii) temporal locality exists in the access patterns of duplicated data. Using the first insight, we selectively deduplicate only sequences of disk blocks. This reduces fragmentation and amortizes the seeks caused by deduplication. The second insight allows us to replace the expensive, on-disk, deduplication metadata with a smaller, in-memory cache. These techniques enable us to tradeoff capacity savings for performance, as demonstrated in our evaluation with real-world workloads. Our evaluation shows that iDedup achieves 60-70% of the maximum deduplication with less than a 5% CPU overhead and a 2-4% latency impact.