Collection-based compression using discovered long matching strings

Authors:
Andrew Peel;Anthony Wirth;Justin Zobel
Affiliations:
The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 6
Cited 0

Delta algorithms: an empirical analysis

ACM Transactions on Software Engineering and Methodology (TOSEM)
Data compression with long repeated strings

Information Sciences: an International Journal - Dictionary based compression
Compactly encoding unstructured inputs with differential compression

Journal of the ACM (JACM)
Engineering a Differencing and Compression Data Format

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
An approximation to the greedy algorithm for differential compression

IBM Journal of Research and Development - Spintronics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many collections of data contain items that are inherently similar. For example, archives contain files with incremental changes between releases. Long-range inter-file similarities are not exploited by standard approaches to compression. We investigate compression using similarity from all parts of a collection, collection-based compression (CBC). Input files are delta-encoded by reference to long string matches in a source collection. The expected space requirement of our encoding algorithm is sublinear with the collection size, and the compression time complexity is linear with the input file size. We show that our scheme achieves better compression for large input files than existing differential compression systems, and scales better. Also, we achieve significant compression improvement compared to compressing each file individually using standard utilities: our scheme achieves several times the compression of gzip or 7-zip. The overall result is a dramatic improvement on compression available with existing approaches.