Scalable detection of frequent substrings by grammar-based compression

Authors:
Masaya Nakahara;Shirou Maruyama;Tetsuji Kuboyama;Hiroshi Sakamoto
Affiliations:
Kyushu Institute of Technology, Iizuka-shi, Fukuoka;Kyushu University, Fukuoka;Gakushuin University, Toshima, Tokyo;Kyushu Institute of Technology, Fukuoka and PRESTO JST, Saitama, Japan
Venue:
DS'11 Proceedings of the 14th international conference on Discovery science
Year:
2011

Citing 13
Cited 1

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Approximation algorithms for grammar-based compression

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array

ISAAC '00 Proceedings of the 11th International Conference on Algorithms and Computation
Rapid identification of repeated patterns in strings, trees and arrays

STOC '72 Proceedings of the fourth annual ACM symposium on Theory of computing
Application of Lempel--Ziv factorization to the approximation of grammar-based compression

Theoretical Computer Science
The string edit distance matching problem with moves

ACM Transactions on Algorithms (TALG)
Edit distance with move operations

Journal of Discrete Algorithms
Engineering a compressed suffix tree implementation

Journal of Experimental Algorithmics (JEA)
An efficient algorithm for finding similar short substrings from large scale string data

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Grammar-based codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory
The similarity metric

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory
The smallest grammar problem

IEEE Transactions on Information Theory

ESP-index: A compressed index based on edit-sensitive parsing

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar (CFG) deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a CFG as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor by Sakamoto et al. (2009) is especially suitable for detecting maximal common substrings as well as long frequent substrings. This is made possible thanks to the characteristics of edit-sensitive parsing (ESP) by Cormode and Muthukrishnan (2007), which was introduced to approximate a variant of edit distance. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove a lower bound for the length of extracted frequent patterns. We also examine the performance of our algorithm by experiments in DNA sequences and other compressible real world texts. Compared to the practical algorithm developed by Uno (2008), our algorithm is faster with large and repetitive strings.