Character sets of strings

  • Authors:
  • Gilles Didier;Thomas Schmidt;Jens Stoye;Dekel Tsur

  • Affiliations:
  • Centro de Modelamiento Matematico CNRS UMR 2071 Santiago de Chile, Chile;International NRW Graduate School in Bioinformatics and Genome Research, Center of Biotechnology, Universität Bielefeld, 33594 Bielefeld, Germany;Technische Fakultät, Universität Bielefeld, 33594 Bielefeld, Germany;Computer Science Department, Ben-Gurion University, Israel

  • Venue:
  • Journal of Discrete Algorithms
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given a string S over a finite alphabet @S, the character set (also called the fingerprint) of a substring S^' of S is the subset C@?@S of the symbols occurring in S^'. The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially:(1)Output all the maximal locations of substrings having a given character set. (2)Output for each character set C occurring in a given string (or a given collection of strings) all the maximal locations of C. Denoting by n the total length of the considered string or collection of strings, we solve the first problem in @Q(n) time using @Q(n) space. We present two algorithms solving the second problem. The first one runs in @Q(n^2) time using @Q(n) space. The second algorithm has @Q(n|@S|log|@S|) time and @Q(n) space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1-13].