Fast file existence checking in archiving systems

Authors:
Saso Tomazic;Vesna Pavlovic;Jasna Milovanovic;Jaka Sodnik;Anton Kos;Sara Stancin;Veljko Milutinovic
Affiliations:
University of Ljubljana, Ljubljana, Slovenia;University of Belgrade, Beograd, Serbia;University of Belgrade, Beograd, Serbia;University of Ljubljana, Ljubljana, Slovenia;University of Ljubljana, Ljubljana, Slovenia;University of Ljubljana, Ljubljana, Slovenia;IPSI Belgrade
Venue:
ACM Transactions on Storage (TOS)
Year:
2011

Citing 7
Cited 0

A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Introduction to Algorithms

Introduction to Algorithms
Acceleration of Nonnumeric Operations Using Hardware Support for the Ordered Table Hashing Algorithms

IEEE Transactions on Computers
Pastiche: making backup cheap and easy

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Venti: a new approach to archival storage

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents a new Fast Hash-based File Existence Checking (FHFEC) method for archiving systems. During the archiving process, there are many submissions which are actually unchanged files that do not need to be re-archived. In this system, instead of comparing the entire files, only digests of the files are compared. Strong cryptographic hash functions with a low probability of collision can be used as digests. We propose a fast algorithm to check if a certain hash, that is, a corresponding file, is already stored in the system. The algorithm is based on dividing the whole domain of hashes into equally sized regions, and on the existence of a pointer array, which has exactly one pointer for each region. Each pointer points to the location of the first stored hash from the corresponding region and has a null value if no hash from that region exists. The entire structure can be stored in random access memory or, alternatively, on a dedicated hard disk. A statistical performance analysis has been performed that shows that in certain cases FHFEC performs nearly optimally. Extensive simulations have confirmed these analytical results. The performance of FHFEC has been compared to the performance of a binary search (BIS) and B+tree, which are commonly used in file systems and databases for table indices. The results show that FHFEC significantly outperforms both of them.