Compressed property suffix trees

  • Authors:
  • Wing-Kai Hon;Manish Patil;Rahul Shah;Sharma V. Thankachan

  • Affiliations:
  • Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan;Department of Computer Science, Louisiana State University, Baton Rouge, LA, USA;Department of Computer Science, Louisiana State University, Baton Rouge, LA, USA;Department of Computer Science, Louisiana State University, Baton Rouge, LA, USA

  • Venue:
  • Information and Computation
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Property matching is a biologically motivated problem where the task is to find those occurrences of an online pattern P in a string text T (of size n), such that the matched text part satisfies some conceptual property. The property of a string is a set @p of (possibly overlapping) intervals {(s"1,f"1),(s"2,f"2),...} corresponding to the part of text and an occurrence of a pattern P=T[i,...,(i+|P|-1)] is a valid output only if T[i,...,(i+|P|-1)] is completely contained in at least one interval (s"j,f"j)@?@p. The indexing version of this problem was introduced by A. Amir (2008), where the text is preprocessed in O(nlog@s+nloglogn) time and an O(nlogn) bits index, named Property Suffix Tree (PST) is maintained. PST can perform property matching in O(|P|log@s+occ"@p) time, where occ"@p is the number of occurrences of P in T satisfying the property. T. Kopelowitz (2010) considered the dynamic version of this problem where intervals can be added or deleted. However, all these indexes take space linear to the size of text (O(nlogn) bits), which can be much more than the size of the text (nlog@s bits). In this paper, we propose the first index for property matching occupying space close to the entropy compressed space requirement of the text. Our compressed index takes |CSA|+n(2+@e+o(1)) bits space and performs query answering in O(t(|P|)+1@e(1+occ"@p)t"S"A) time, where |CSA| is the size of compressed suffix array of T, t(|P|) be the time for searching a pattern of length |P| in CSA, t"S"A is the time for computing the suffix array value and @e0 is a constant. We also introduce a dynamic index, which takes |CSA|+O(n+|@p|logn) bits space and performs query answering in O(t(|P|)+(1+occ"@p)logn(t"S"A+logn/loglogn)) time and can update (insert/delete) an interval (s,f) in O((f-s)(logn+t"S"A)) time.