Fast algorithms for finding a minimum repetition representation of strings and trees

  • Authors:
  • Atsuyoshi Nakamura;Tomoya Saito;Ichigaku Takigawa;Mineichi Kudo;Hiroshi Mamitsuka

  • Affiliations:
  • Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan;Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan;Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan;Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan;Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan

  • Venue:
  • Discrete Applied Mathematics
  • Year:
  • 2013

Quantified Score

Hi-index 0.04

Visualization

Abstract

A string with many repetitions can be represented compactly by replacing h-fold contiguous repetitions of a string r with (r)^h. We present a compact representation, which we call a repetition representation (of a string) or RRS, by which a set of disjoint or nested tandem arrays can be compacted. In this paper, we study the problem of finding a minimum RRS or MRRS, where the size of an RRS is defined by the sum of the length of component letters and the description length of the component repetitions (@?)^h which is defined by w"R(h) using a repetition weight function w"R. We develop two dynamic programming-based algorithms to solve this problem: CMR, which works for any type of w"R, and CMR-C, which is faster but can be applied to a constant w"R only. CMR-C is an O(n^2logn)-time O(nlogn)-space algorithm, which is more efficient in both time and space than CMR by a ((logn)/n)-factor, where n is the length of the given string. The problem of finding an MRRS for a string can be extended to that of finding a minimum repetition representation (of a tree) or MRRT for a given labeled ordered tree. For this problem, we present two algorithms, CMRT and CMRT-C, by using CMR and CMR-C, respectively, as a subroutine. As well as the theoretical analysis, we confirmed the efficiency of the proposed algorithms by experiments, which consist of the following three parts: First we demonstrated that CMR-C and CMRT-C are fast enough for large-scale data by using synthetic strings and trees, respectively. The size of an MRRS for a given string can be a measure of how compactly the string can be represented, meaning how well the string is structurally organized. This is also true of trees. To check such ability of MRRS-size, second we measured the size of an MRRS for chromosomes of nine different species. We found that all the chromosomes of the same species have a similar compression rate when realized by an MRRS. Run length encoding (RLE) was also shown to have species-specific compression rate, but species were separated more clearly by MRRS than by RLE. Third we examined the size of an MRRT for web pages of world-leading companies by using the tag trees, showing a consistency between the compression rate by an MRRT and visual web page structures.