Cluster-Based Delta Compression of a Collection of Files

  • Authors:
  • Zan Ouyang;Nasir D. Memon;Torsten Suel;Dimitre Trendafilov

  • Affiliations:
  • -;-;-;-

  • Venue:
  • WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

Delta compression techniques are commonly used tosuccinctly represent an updated version of a file with respectto an earlier one. In this paper, we study the use ofdelta compression in a somewhat different scenario, wherewe wish to compress a large collection of (more or less) relatedfiles by performing a sequence of pairwise delta compressions.The problem of finding an optimal delta encodingfor a collection of files by taking pairwise deltas can bereduced to the problem of computing a branching of maximumweight in a weighted directed graph, but this solutionis inefficient and thus does not scale to larger file collections.This motivates us to propose a framework for cluster-baseddelta compression that uses text clustering techniquesto prune the graph of possible pairwise delta encodings. Todemonstrate the efficacy of our approach, we present experimentalresults on collections of web pages. Our exper-imentsshow that cluster-based delta compression of col-lectionsprovides significant improvements in compressionratio as compared to individually compressing each file orusing tar+gzip, at a moderate cost in efficiency.