Locality-sensitive hashing scheme based on p-stable distributions
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Server-friendly delta compression for efficient web access
Web content caching and distribution
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching
IEEE Transactions on Knowledge and Data Engineering
Approximate maximum weight branchings
Information Processing Letters
Improving duplicate elimination in storage systems
ACM Transactions on Storage (TOS)
Redundancy elimination within large collections of files
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
On compressing the textual web
Proceedings of the third ACM international conference on Web search and data mining
Scalable techniques for document identifier assignment in inverted indexes
Proceedings of the 19th international conference on World wide web
PRESIDIO: A Framework for Efficient Archival Data Storage
ACM Transactions on Storage (TOS)
COCA filters: co-occurrence aware bloom filters
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A novel approach for leveraging co-occurrence to improve the false positive error in signature files
Journal of Discrete Algorithms
LSH-based large scale chinese calligraphic character recognition
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Hi-index | 0.01 |
Delta compression techniques are commonly used tosuccinctly represent an updated version of a file with respectto an earlier one. In this paper, we study the use ofdelta compression in a somewhat different scenario, wherewe wish to compress a large collection of (more or less) relatedfiles by performing a sequence of pairwise delta compressions.The problem of finding an optimal delta encodingfor a collection of files by taking pairwise deltas can bereduced to the problem of computing a branching of maximumweight in a weighted directed graph, but this solutionis inefficient and thus does not scale to larger file collections.This motivates us to propose a framework for cluster-baseddelta compression that uses text clustering techniquesto prune the graph of possible pairwise delta encodings. Todemonstrate the efficacy of our approach, we present experimentalresults on collections of web pages. Our exper-imentsshow that cluster-based delta compression of col-lectionsprovides significant improvements in compressionratio as compared to individually compressing each file orusing tar+gzip, at a moderate cost in efficiency.