Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
A low-bandwidth network file system
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Group-Based Management of Distributed File Caches
ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
File access prediction with adjustable accuracy
PCC '02 Proceedings of the Performance, Computing, and Communications Conference, 2002. on 21st IEEE International
Jumbo store: providing efficient incremental upload and versioning for a utility rendering service
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Fast, inexpensive content-addressed storage in foundation
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
Multi-level comparison of data deduplication in a backup scenario
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
I/O Deduplication: Utilizing content similarity to improve I/O performance
ACM Transactions on Storage (TOS)
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Decentralized deduplication in SAN cluster file systems
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
ChunkStash: speeding up inline storage deduplication using flash memory
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Experiences with content addressable storage and virtual disks
WIOV'08 Proceedings of the First conference on I/O virtualization
dedupv1: Improving deduplication throughput using solid state drives (SSD)
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
MAD2: A scalable high-throughput exact deduplication approach for network backup services
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Venti: a new approach to archival storage
FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Building a high-performance deduplication system
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CABdedupe: A Causality-Based Deduplication Performance Booster for Cloud Backup Services
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
DBLK: Deduplication for primary block storage
MSST '11 Proceedings of the 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies
Characteristics of backup workloads in production systems
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Power consumption in enterprise-scale backup storage systems
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Reducing impact of data fragmentation caused by in-line deduplication
Proceedings of the 5th Annual International Systems and Storage Conference
WAN-optimized replication of backup datasets using stream-informed delta compression
ACM Transactions on Storage (TOS)
A study on data deduplication in HPC storage systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Data deduplication systems discover and remove redundancies between data blocks by splitting the data stream into chunks and comparing a hash of each chunk with all previously stored hashes. Storing the corresponding chunk index on hard disks immediately limits the achievable throughput, as these devices are unable to support the high number of random IOs induced by this index. Several approaches to overcome this chunk lookup disk bottleneck have been proposed. Often, the approaches try to capture the locality information of a backup run and use this in the next backup run to predict future chunk requests. However, often this locality is only captured by a surrogate, e.g., the order of the chunks in containers. [37]. Furthermore, some approaches degenerate slowly when the systems operate over months and years because the locality information becomes outdated. We propose a novel approach, called Block Locality Cache (BLC), that captures the previous backup run significantly better than existing approaches and also always uses up-to-date locality information and which is, therefore, less prone to aging. We evaluate the approach using a trace-based simulation of multiple real-world backup datasets. The simulation compares the Block Locality Cache with the approach of Zhu et al. [37] and provides a detailed analysis of the behavior and IO pattern. Furthermore, a prototype implementation is used to validate the simulation.