Robust transmission of unbounded strings using Fibonacci representations
IEEE Transactions on Information Theory
Data compression: methods and theory
Data compression: methods and theory
On-line versus off-line computation in dynamic text compression
Information Processing Letters
Pattern matching algorithms
Pattern Matching Image Compression: Algorithmic and Empirical Results
IEEE Transactions on Pattern Analysis and Machine Intelligence
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Approximation algorithms for grammar-based compression
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications
Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications
Compact recognizers of episode sequences
Information and Computation
An Output-Sensitive Flexible Pattern Discovery Algorithm
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Compression and the Wheel of Fortune
DCC '03 Proceedings of the Conference on Data Compression
Motifs in Ziv-Lempel-Welch Clef
DCC '04 Proceedings of the Conference on Data Compression
A suboptimal lossy data compression based on approximate pattern matching
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory
An implementable lossy version of the Lempel-Ziv algorithm. I. Optimality for memoryless sources
IEEE Transactions on Information Theory
Efficient parallel construction of suffix trees for genomes larger than main memory
Proceedings of the 20th European MPI Users' Group Meeting
Fast computation of entropic profiles for the detection of conservation in genomes
PRIB'13 Proceedings of the 8th IAPR international conference on Pattern Recognition in Bioinformatics
Hi-index | 0.00 |
We present data compression techniques hinged on the notion of a motif, interpreted here as a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. This notion arises originally in the analysis of sequences, particularly biomolecules, due to its multiple implications in the understanding of biological structure and function, and it has been the subject of various characterizations and study. Correspondingly, motif discovery techniques and tools have been devised. This task is made hard by the circumstance that the number of motifs identifiable in general in a sequence can be exponential in the size of that sequence. A significant gain in the direction of reducing the number of motifs is achieved through the introduction of irredundant motifs, which in intuitive terms are motifs of which the structure and list of occurrences cannot be inferred by a combination of other motifs' occurrences. Although suboptimal, the available procedures for the extraction of some such motifs are not prohibitively expensive. Here we show that irredundant motifs can be usefully exploited in lossy compression methods based on textual substitution and suitable for signals as well as text. Actually, once the motifs in our lossy encodings are disambiguated into corresponding lossless codebooks, they still prove capable of yielding savings over popular methods in use. Preliminary experiments with these fungible strategies at the crossroads of lossless and lossy data compression show performances that improve over popular methods (i.e. GZip) by more than 20% in lossy and 10% in lossless implementations.