Fast file existence checking in archiving systems

  • Authors:
  • Saso Tomazic;Vesna Pavlovic;Jasna Milovanovic;Jaka Sodnik;Anton Kos;Sara Stancin;Veljko Milutinovic

  • Affiliations:
  • University of Ljubljana, Ljubljana, Slovenia;University of Belgrade, Beograd, Serbia;University of Belgrade, Beograd, Serbia;University of Ljubljana, Ljubljana, Slovenia;University of Ljubljana, Ljubljana, Slovenia;University of Ljubljana, Ljubljana, Slovenia;IPSI Belgrade

  • Venue:
  • ACM Transactions on Storage (TOS)
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This article presents a new Fast Hash-based File Existence Checking (FHFEC) method for archiving systems. During the archiving process, there are many submissions which are actually unchanged files that do not need to be re-archived. In this system, instead of comparing the entire files, only digests of the files are compared. Strong cryptographic hash functions with a low probability of collision can be used as digests. We propose a fast algorithm to check if a certain hash, that is, a corresponding file, is already stored in the system. The algorithm is based on dividing the whole domain of hashes into equally sized regions, and on the existence of a pointer array, which has exactly one pointer for each region. Each pointer points to the location of the first stored hash from the corresponding region and has a null value if no hash from that region exists. The entire structure can be stored in random access memory or, alternatively, on a dedicated hard disk. A statistical performance analysis has been performed that shows that in certain cases FHFEC performs nearly optimally. Extensive simulations have confirmed these analytical results. The performance of FHFEC has been compared to the performance of a binary search (BIS) and B+tree, which are commonly used in file systems and databases for table indices. The results show that FHFEC significantly outperforms both of them.