Compressed indexes for aligned pattern matching

Authors:
Sharma V. Thankachan
Affiliations:
Department of CS, Louisiana State University
Venue:
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Year:
2011

Citing 17
Cited 0

Lower bounds for orthogonal range searching: I. The reporting case

Journal of the ACM (JACM)
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Succinct Representation of Balanced Parentheses and Static Trees

SIAM Journal on Computing
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
New data structures for orthogonal range searching

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
The SBC-tree: an index for run-length compressed sequences

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing

DCC '08 Proceedings of the Data Compression Conference
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Space-Efficient Framework for Top-k String Retrieval Problems

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
String retrieval for multi-pattern queries

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many situations like protein sequences, the primary protein sequence is associated with secondary structure labels [6]. This can be treated as two sequences aligned character by character. Many other DNA and RNA sequences involve linkages which are aligned across or in the same or different strands. In this paper, we consider the most natural characterization of aligned string data. The aligned pattern matching problem is to index two input texts. T1[1...n] and T2[1...n], each having n characters taken from an alphabet set Σ of size σ = polylog(n), such that the following query can be answered efficiently: given two query patterns P1 and P2, find all the text. positions i such that P1 matches with T1[i...(i+|P1|-1)] and P2 matches with T2[i...(i + |P2| - 1)]. Our objective is to design a compressed space index for this problem and we obtained the following main results: when the query patterns are sufficiently long (|P1|, |P2| α = Θ(log2+2ε n), where ε 0), we can design an index which takes nH′k +nH″k +o(n log σ) bits space and O(|P1| + |P2| + log4+4ε n + t) query time, where H′k and H″k denotes the empirical kth-order entropy (k = o(logσ n)) of T1 and T2 respectively, t represents the number of outputs and ε 0. Further we show that designing a compressed/succinct space index with polylogarithmic query time, which works for query patterns of all lengths is at least as hard as designing a linear space index for 3-dimensional orthogonal range reporting with poly-logarithmic query time. However, we introduce another compressed index of nH′k + nH″k + O(n) + o(n log σ) bits space requirement with a query time of O(|P1|+|P2|+√nt log2+ε n) which works without any restriction on the length of the patterns.