Compression and the Wheel of Fortune

  • Authors:
  • Alberto Apostolico;Laxmi Parida

  • Affiliations:
  • -;-

  • Venue:
  • DCC '03 Proceedings of the Conference on Data Compression
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present data compression techniques hinged on the notion of a motif,interpreted here as a string of intermittently solid and wild characters that recursmore or less frequently in an input sequence or family of sequences.Thisnotion arises orginally in the analysis of sequences, particularly biomolecules,due to its multiple implications in the understanding of biological structure andfunction, and it has been the subject of various characterizations and study.Correspondingly, motif discovery techniques and tools have been devised.Thistask is made hard by the circumstance that the number of motifs identifiablein general in a sequence can be exponential in the size of that sequence.A significant gain in the direction of reducing the number of motifs is achievedthrough the introduction of irredundant motifs, which in intuitive terms aremotifs of which the structure and list of occurrences cannot be inferred bya combination of other motifs' occurrences.Remarkably, the number of irredundantmotifs in a sequence is at worst linear in the length of that sequence.Although suboptimal, the available procedures for the extraction of such motifsare not prohibitively expensive.Here we show that irredundant motifs can beusefully exploited in lossy compression methods based on textual substitutionand suitable for signals as well as text.Actually, once the motifs in our lossyencodings are disambiguated into corresponding lossless codebooks, they stillprove capable of yielding savings over popular methods in use.Preliminary experimentswith these fungible strategies at the crossroads of lossless and lossydata compression show performances that improve over popular methods bymore than 20% in lossy and 10% in lossless implementations.