Bridging lossy and lossless compression by motif pattern discovery

Authors:
A. Apostolico;M. Comin;L. Parida
Affiliations:
-;-;-
Venue:
General Theory of Information Transfer and Combinatorics
Year:
2006

Citing 15
Cited 2

Robust transmission of unbounded strings using Fibonacci representations

IEEE Transactions on Information Theory
Data compression: methods and theory

Data compression: methods and theory
On-line versus off-line computation in dynamic text compression

Information Processing Letters
Pattern matching algorithms

Pattern matching algorithms
Pattern Matching Image Compression: Algorithmic and Empirical Results

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Approximation algorithms for grammar-based compression

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications

Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications
Compact recognizers of episode sequences

Information and Computation
An Output-Sensitive Flexible Pattern Discovery Algorithm

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Compression and the Wheel of Fortune

DCC '03 Proceedings of the Conference on Data Compression
Motifs in Ziv-Lempel-Welch Clef

DCC '04 Proceedings of the Conference on Data Compression
A suboptimal lossy data compression based on approximate pattern matching

IEEE Transactions on Information Theory
Lossy source coding

IEEE Transactions on Information Theory
An implementable lossy version of the Lempel-Ziv algorithm. I. Optimality for memoryless sources

IEEE Transactions on Information Theory

Efficient parallel construction of suffix trees for genomes larger than main memory

Proceedings of the 20th European MPI Users' Group Meeting
Fast computation of entropic profiles for the detection of conservation in genomes

PRIB'13 Proceedings of the 8th IAPR international conference on Pattern Recognition in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present data compression techniques hinged on the notion of a motif, interpreted here as a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. This notion arises originally in the analysis of sequences, particularly biomolecules, due to its multiple implications in the understanding of biological structure and function, and it has been the subject of various characterizations and study. Correspondingly, motif discovery techniques and tools have been devised. This task is made hard by the circumstance that the number of motifs identifiable in general in a sequence can be exponential in the size of that sequence. A significant gain in the direction of reducing the number of motifs is achieved through the introduction of irredundant motifs, which in intuitive terms are motifs of which the structure and list of occurrences cannot be inferred by a combination of other motifs' occurrences. Although suboptimal, the available procedures for the extraction of some such motifs are not prohibitively expensive. Here we show that irredundant motifs can be usefully exploited in lossy compression methods based on textual substitution and suitable for signals as well as text. Actually, once the motifs in our lossy encodings are disambiguated into corresponding lossless codebooks, they still prove capable of yielding savings over popular methods in use. Preliminary experiments with these fungible strategies at the crossroads of lossless and lossy data compression show performances that improve over popular methods (i.e. GZip) by more than 20% in lossy and 10% in lossless implementations.