Various improvements to text fingerprinting

  • Authors:
  • Djamal Belazzougui;Roman Kolpakov;Mathieu Raffinot

  • Affiliations:
  • Department of Computer Science, University of Helsinki, FI-00014, Finland;Liapunov French-Russian Institute, Lomonosov Moscow State University, Moscow, Russia;LIAFA, Univ. Paris Diderot - Paris 7, 75205 Paris Cedex 13, France

  • Venue:
  • Journal of Discrete Algorithms
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Let s=s"1..s"n be a text (or sequence) on a finite alphabet @S of size @s. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set F of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s"i..s"j is a maximal location for a fingerprint f@?F (denoted by ) if the alphabet of s"i..s"j is f and s"i"-"1, s"j"+"1, if defined, are not in f. The set of maximal locations in s is L (it is easy to see that |L|= and such that s"i..s"j=s"k..s"l are named copies, and the quotient set of L according to the copy relation is denoted by L"C. We first present new exact efficient algorithms and data structures for the following three problems: (1) to compute F; (2) given f as a set of distinct characters in @S, to answer if f represents a fingerprint in F; (3) given f, to find all maximal locations of f in s. As well as in papers concerning succinct data structures, in the paper all space complexities are counted in bits. Problem 1 is solved either in O(n+|L"C|log@s) worst-case time (in this paper all logarithms are intended as base two logarithms) using O((n+|L"C|+|F|log@s)logn) bits of space, or in O(n+|L|log@s) randomized expected time using O((n+|F|log@s)logn) bits of space. Problem 2 is solved either in O(|f|) expected time if only O(|f|logn) bits of working space for queries is allowed, or in worst-case O(|f|/@e) time if a working space of O(@s^@elogn) bits is allowed (with @e a constant satisfying 0