Compressed indexes for aligned pattern matching

  • Authors:
  • Sharma V. Thankachan

  • Affiliations:
  • Department of CS, Louisiana State University

  • Venue:
  • SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In many situations like protein sequences, the primary protein sequence is associated with secondary structure labels [6]. This can be treated as two sequences aligned character by character. Many other DNA and RNA sequences involve linkages which are aligned across or in the same or different strands. In this paper, we consider the most natural characterization of aligned string data. The aligned pattern matching problem is to index two input texts. T1[1...n] and T2[1...n], each having n characters taken from an alphabet set Σ of size σ = polylog(n), such that the following query can be answered efficiently: given two query patterns P1 and P2, find all the text. positions i such that P1 matches with T1[i...(i+|P1|-1)] and P2 matches with T2[i...(i + |P2| - 1)]. Our objective is to design a compressed space index for this problem and we obtained the following main results: when the query patterns are sufficiently long (|P1|, |P2| α = Θ(log2+2ε n), where ε 0), we can design an index which takes nH′k +nH″k +o(n log σ) bits space and O(|P1| + |P2| + log4+4ε n + t) query time, where H′k and H″k denotes the empirical kth-order entropy (k = o(logσ n)) of T1 and T2 respectively, t represents the number of outputs and ε 0. Further we show that designing a compressed/succinct space index with polylogarithmic query time, which works for query patterns of all lengths is at least as hard as designing a linear space index for 3-dimensional orthogonal range reporting with poly-logarithmic query time. However, we introduce another compressed index of nH′k + nH″k + O(n) + o(n log σ) bits space requirement with a query time of O(|P1|+|P2|+√nt log2+ε n) which works without any restriction on the length of the patterns.