Cluster-Based Delta Compression of a Collection of Files

Authors:
Zan Ouyang;Nasir D. Memon;Torsten Suel;Dimitre Trendafilov
Affiliations:
-;-;-;-
Venue:
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Year:
2002

Citing 0
Cited 13

Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Server-friendly delta compression for efficient web access

Web content caching and distribution
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching

IEEE Transactions on Knowledge and Data Engineering
Approximate maximum weight branchings

Information Processing Letters
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
PRESIDIO: A Framework for Efficient Archival Data Storage

ACM Transactions on Storage (TOS)
COCA filters: co-occurrence aware bloom filters

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A novel approach for leveraging co-occurrence to improve the false positive error in signature files

Journal of Discrete Algorithms
LSH-based large scale chinese calligraphic character recognition

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.01

Visualization

Abstract

Delta compression techniques are commonly used tosuccinctly represent an updated version of a file with respectto an earlier one. In this paper, we study the use ofdelta compression in a somewhat different scenario, wherewe wish to compress a large collection of (more or less) relatedfiles by performing a sequence of pairwise delta compressions.The problem of finding an optimal delta encodingfor a collection of files by taking pairwise deltas can bereduced to the problem of computing a branching of maximumweight in a weighted directed graph, but this solutionis inefficient and thus does not scale to larger file collections.This motivates us to propose a framework for cluster-baseddelta compression that uses text clustering techniquesto prune the graph of possible pairwise delta encodings. Todemonstrate the efficacy of our approach, we present experimentalresults on collections of web pages. Our exper-imentsshow that cluster-based delta compression of col-lectionsprovides significant improvements in compressionratio as compared to individually compressing each file orusing tar+gzip, at a moderate cost in efficiency.