Hash challenges: Stretching the limits of compare-by-hash in distributed data deduplication

Authors:
João Barreto;Luís Veiga;Paulo Ferreira
Affiliations:
INESC-ID and Technical University of Lisbon, Portugal;INESC-ID and Technical University of Lisbon, Portugal;INESC-ID and Technical University of Lisbon, Portugal
Venue:
Information Processing Letters
Year:
2012

Citing 13
Cited 1

A probabilistic distributed algorithm for set intersection and its analysis

Theoretical Computer Science
The probabilistic communication complexity of set intersection

SIAM Journal on Discrete Mathematics
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Hierarchical substring caching for efficient content distribution to low-bandwidth clients

WWW '05 Proceedings of the 14th international conference on World Wide Web
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Shark: scaling file servers via cooperative caching

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Jumbo store: providing efficient incremental upload and versioning for a utility rendering service

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Demystifying data deduplication

Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion
Towards understanding developing world traffic

Proceedings of the 4th ACM Workshop on Networked Systems for Developing Regions
The good, the bad and the ugly of consumer cloud storage

ACM SIGOPS Operating Systems Review
Wide-area network acceleration for the developing world

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference

SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.89

Visualization

Abstract

We propose a technique for reducing communication overheads when sending data across a network. Our technique, called hash challenges, leverages existing deduplication solutions based on compare-by-hash by being able to determine redundant data chunks by exchanging substantially less meta-data. Hash challenges can be used directly on any existing compare-by-hash protocol, with no relevant additional computational complexity. Using real data from reference workloads, we show that hash challenges can save as much as 64% meta-data exchanged across the network, relatively to plain compare-by-hash. This implies reductions of up to 7% in overall transferred volume, and performance gains of up to 16% with typical asymmetrical broadband connections.