Collection-based compression using discovered long matching strings

  • Authors:
  • Andrew Peel;Anthony Wirth;Justin Zobel

  • Affiliations:
  • The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia

  • Venue:
  • Proceedings of the 20th ACM international conference on Information and knowledge management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many collections of data contain items that are inherently similar. For example, archives contain files with incremental changes between releases. Long-range inter-file similarities are not exploited by standard approaches to compression. We investigate compression using similarity from all parts of a collection, collection-based compression (CBC). Input files are delta-encoded by reference to long string matches in a source collection. The expected space requirement of our encoding algorithm is sublinear with the collection size, and the compression time complexity is linear with the input file size. We show that our scheme achieves better compression for large input files than existing differential compression systems, and scales better. Also, we achieve significant compression improvement compared to compressing each file individually using standard utilities: our scheme achieves several times the compression of gzip or 7-zip. The overall result is a dramatic improvement on compression available with existing approaches.