Delta algorithms: an empirical analysis
ACM Transactions on Software Engineering and Methodology (TOSEM)
Data compression with long repeated strings
Information Sciences: an International Journal - Dictionary based compression
Compactly encoding unstructured inputs with differential compression
Journal of the ACM (JACM)
Engineering a Differencing and Compression Data Format
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development - Mathematics and computing
An approximation to the greedy algorithm for differential compression
IBM Journal of Research and Development - Spintronics
Hi-index | 0.00 |
Many collections of data contain items that are inherently similar. For example, archives contain files with incremental changes between releases. Long-range inter-file similarities are not exploited by standard approaches to compression. We investigate compression using similarity from all parts of a collection, collection-based compression (CBC). Input files are delta-encoded by reference to long string matches in a source collection. The expected space requirement of our encoding algorithm is sublinear with the collection size, and the compression time complexity is linear with the input file size. We show that our scheme achieves better compression for large input files than existing differential compression systems, and scales better. Also, we achieve significant compression improvement compared to compressing each file individually using standard utilities: our scheme achieves several times the compression of gzip or 7-zip. The overall result is a dramatic improvement on compression available with existing approaches.