Fast algorithms for finding a minimum repetition representation of strings and trees

Authors:
Atsuyoshi Nakamura;Tomoya Saito;Ichigaku Takigawa;Mineichi Kudo;Hiroshi Mamitsuka
Affiliations:
Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan;Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan;Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan;Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan;Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
Venue:
Discrete Applied Mathematics
Year:
2013

Citing 11
Cited 0

An O(n log n) algorithm for finding all repetitions in a string

Journal of Algorithms
Data compression: techniques and applications, hardware and software considerations (2nd ed.)

Data compression: techniques and applications, hardware and software considerations (2nd ed.)
The exact number of squares in Fibonacci words

Theoretical Computer Science
Variations on the Common Subexpression Problem

Journal of the ACM (JACM)
Simple and flexible detection of contiguous repeats using a suffix tree

Theoretical Computer Science
Finding Maximal Repetitions in a Word in Linear Time

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Linear time algorithms for finding and representing all the tandem repeats in a string

Journal of Computer and System Sciences
Algorithms for finding a minimum repetition representation of a string

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Paper: Modeling by shortest data description

Automatica (Journal of IFAC)
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.04

Visualization

Abstract

A string with many repetitions can be represented compactly by replacing h-fold contiguous repetitions of a string r with (r)^h. We present a compact representation, which we call a repetition representation (of a string) or RRS, by which a set of disjoint or nested tandem arrays can be compacted. In this paper, we study the problem of finding a minimum RRS or MRRS, where the size of an RRS is defined by the sum of the length of component letters and the description length of the component repetitions (@?)^h which is defined by w"R(h) using a repetition weight function w"R. We develop two dynamic programming-based algorithms to solve this problem: CMR, which works for any type of w"R, and CMR-C, which is faster but can be applied to a constant w"R only. CMR-C is an O(n^2logn)-time O(nlogn)-space algorithm, which is more efficient in both time and space than CMR by a ((logn)/n)-factor, where n is the length of the given string. The problem of finding an MRRS for a string can be extended to that of finding a minimum repetition representation (of a tree) or MRRT for a given labeled ordered tree. For this problem, we present two algorithms, CMRT and CMRT-C, by using CMR and CMR-C, respectively, as a subroutine. As well as the theoretical analysis, we confirmed the efficiency of the proposed algorithms by experiments, which consist of the following three parts: First we demonstrated that CMR-C and CMRT-C are fast enough for large-scale data by using synthetic strings and trees, respectively. The size of an MRRS for a given string can be a measure of how compactly the string can be represented, meaning how well the string is structurally organized. This is also true of trees. To check such ability of MRRS-size, second we measured the size of an MRRS for chromosomes of nine different species. We found that all the chromosomes of the same species have a similar compression rate when realized by an MRRS. Run length encoding (RLE) was also shown to have species-specific compression rate, but species were separated more clearly by MRRS than by RLE. Third we examined the size of an MRRT for web pages of world-leading companies by using the tag trees, showing a consistency between the compression rate by an MRRT and visual web page structures.