Compare-by-hash: a reasoned analysis

Authors:
J. Black
Affiliations:
University of Colorado, Boulder
Venue:
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Year:
2006

Citing 0
Cited 11

Consistency-preserving caching of dynamic database content

Proceedings of the 16th international conference on World Wide Web
Improving mobile database access over wide-area networks without degrading consistency

Proceedings of the 5th international conference on Mobile systems, applications and services
Evaluating the usefulness of content addressable storage for high-performance data intensive applications

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Fast, inexpensive content-addressed storage in foundation

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Multi-level comparison of data deduplication in a backup scenario

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Constructing and managing appliances for cloud deployments from repositories of reusable components

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Tolerating file-system mistakes with EnvyFS

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Leveraging value locality in optimizing NAND flash-based SSDs

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
VMFlock: virtual machine co-migration for the cloud

Proceedings of the 20th international symposium on High performance distributed computing
A study on data deduplication in HPC storage systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A novel approach to data deduplication over the engineering-oriented cloud systems

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compare-by-hash is the now-common practice used by systems designers who assume that when the digest of a cryptographic hash function is equal on two distinct files, then those files are identical. This approach has been used in both real projects and in research efforts (for example rysnc [16] and LBFS [12]). A recent paper by Henson criticized this practice [8]. The present paper revisits the topic from an advocate's standpoint: we claim that compare-by-hash is completely reasonable, and we offer various arguments in support of this viewpoint in addition to addressing concerns raised by Henson.