Compressed text indexing with wildcards

  • Authors:
  • Wing-Kai Hon;Tsung-Han Ku;Rahul Shah;Sharma V. Thankachan;Jeffrey Scott Vitter

  • Affiliations:
  • National Tsing Hua University, Taiwan;National Tsing Hua University, Taiwan;Louisiana State University, USA;Louisiana State University, USA;The University of Kansas, USA

  • Venue:
  • Journal of Discrete Algorithms
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Let T=T"1@f^k^"^1T"2@f^k^"^2...@f^k^"^dT"d"+"1 be a text of total length n, where characters of each T"i are chosen from an alphabet @S of size @s, and @f denotes a wildcard symbol. The text indexing with wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nH"h+o(nlog@s)+O(dlogn) bits of space, where H"h is the hth-order empirical entropy (h=o(log"@sn)) of T.