Substring compression problems

Authors:
Graham Cormode;S. Muthukrishnan
Affiliations:
Rutgers University, Piscataway NJ;Rutgers University, Piscataway NJ
Venue:
SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Year:
2005

Citing 23
Cited 5

Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms

STOC '86 Proceedings of the eighteenth annual ACM symposium on Theory of computing
Text compression

Text compression
A new challenge for compression algorithms: genetic sequences

Information Processing and Management: an International Journal - Special issue: data compression
String matching in Lempel-Ziv compressed strings

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Let sleeping files lie: pattern matching in Z-compressed files

Journal of Computer and System Sciences
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Range searching

Handbook of discrete and computational geometry
Linear Algorithm for Data Compression via String Matching

Journal of the ACM (JACM)
Data compression via textual substitution

Journal of the ACM (JACM)
A compression algorithm for DNA sequences and its applications in genome comparison

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Approximate nearest neighbors and sequence comparison with block operations

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
The string edit distance matching problem with moves

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
The similarity metric

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
String Matching with Preprocessing of Text and Pattern

ICALP '91 Proceedings of the 18th International Colloquium on Automata, Languages and Programming
Efficient Randomized Dictionary Matching Algorithms (Extended Abstract)

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Protein Is Incompressible

DCC '99 Proceedings of the Conference on Data Compression
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
New data structures for orthogonal range searching

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Rapid identification of repeated patterns in strings, trees and arrays

STOC '72 Proceedings of the fourth annual ACM symposium on Theory of computing
Efficient approximate and dynamic matching of patterns using a labeling paradigm

FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

On the bit-complexity of Lempel-Ziv compression

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Generalized Substring Compression

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Locally consistent parsing and applications to approximate string comparisons

DLT'05 Proceedings of the 9th international conference on Developments in Language Theory
Range LCP

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Fingerprints in compressed strings

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures

Quantified Score

Hi-index	0.00

Visualization

Abstract

We initiate a new class of string matching problems called Substring Compression Problems. Given a string S that may be preprocessed, the problem is to quickly find the compressed representation or the compressed size of any query substring of S (Substring Compression Query or SCQ) or to find the length l substring of S whose compression is the least (Least Compressible Substring or LCS problem).Starting from the seminal paper of Lempel and Ziv over 25 years ago, many different methods have emerged for compressing entire strings. Determining substring compressibility is a natural variant that is combinatorially and algorithmically challenging, yet surprisingly has not been studied before. In addition, compressibility of strings is emerging as a tool to compare biological sequences and analyze their information content. However, typically, the compressibility of the entire sequence is not as informative as that of portions of the sequences. Thus substring compressibility may be a more suitable basis for sequence analysis.We present the first known, nearly optimal algorithms for substring compression problems---SCQ, LCS and their generalizations---that are exact or provably approximate. Our exact algorithms exploit the structure in strings via suffix trees and our approximate algorithms rely on new relationships we find between Lempel-Ziv compression and string parsings.