STOC '86 Proceedings of the eighteenth annual ACM symposium on Theory of computing
Text compression
A new challenge for compression algorithms: genetic sequences
Information Processing and Management: an International Journal - Special issue: data compression
String matching in Lempel-Ziv compressed strings
STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Let sleeping files lie: pattern matching in Z-compressed files
Journal of Computer and System Sciences
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Handbook of discrete and computational geometry
Linear Algorithm for Data Compression via String Matching
Journal of the ACM (JACM)
Data compression via textual substitution
Journal of the ACM (JACM)
A compression algorithm for DNA sequences and its applications in genome comparison
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Approximate nearest neighbors and sequence comparison with block operations
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
The string edit distance matching problem with moves
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
String Matching with Preprocessing of Text and Pattern
ICALP '91 Proceedings of the 18th International Colloquium on Automata, Languages and Programming
Efficient Randomized Dictionary Matching Algorithms (Extended Abstract)
CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
DCC '99 Proceedings of the Conference on Data Compression
Opportunistic data structures with applications
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
New data structures for orthogonal range searching
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Rapid identification of repeated patterns in strings, trees and arrays
STOC '72 Proceedings of the fourth annual ACM symposium on Theory of computing
Efficient approximate and dynamic matching of patterns using a labeling paradigm
FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development - Mathematics and computing
Towards parameter-free data mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On the bit-complexity of Lempel-Ziv compression
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Generalized Substring Compression
CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Locally consistent parsing and applications to approximate string comparisons
DLT'05 Proceedings of the 9th international conference on Developments in Language Theory
ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Fingerprints in compressed strings
WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Hi-index | 0.00 |
We initiate a new class of string matching problems called Substring Compression Problems. Given a string S that may be preprocessed, the problem is to quickly find the compressed representation or the compressed size of any query substring of S (Substring Compression Query or SCQ) or to find the length l substring of S whose compression is the least (Least Compressible Substring or LCS problem).Starting from the seminal paper of Lempel and Ziv over 25 years ago, many different methods have emerged for compressing entire strings. Determining substring compressibility is a natural variant that is combinatorially and algorithmically challenging, yet surprisingly has not been studied before. In addition, compressibility of strings is emerging as a tool to compare biological sequences and analyze their information content. However, typically, the compressibility of the entire sequence is not as informative as that of portions of the sequences. Thus substring compressibility may be a more suitable basis for sequence analysis.We present the first known, nearly optimal algorithms for substring compression problems---SCQ, LCS and their generalizations---that are exact or provably approximate. Our exact algorithms exploit the structure in strings via suffix trees and our approximate algorithms rely on new relationships we find between Lempel-Ziv compression and string parsings.