Compressed indexes for text with wildcards

Authors:
Chris Thachuk
Affiliations:
-
Venue:
Theoretical Computer Science
Year:
2013

Citing 26
Cited 0

An efficient representation for sparse sets

ACM Letters on Programming Languages and Systems (LOPLAS)
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct Representation of Balanced Parentheses and Static Trees

SIAM Journal on Computing
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Succinct static data structures

Succinct static data structures
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing

WADS '09 Proceedings of the 11th International Symposium on Algorithms and Data Structures
Succinct Text Indexing with Wildcards

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
A Compressed Enhanced Suffix Array Supporting Fast String Matching

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Faster entropy-bounded compressed suffix trees

Theoretical Computer Science
High Throughput Short Read Alignment via Bi-directional BWT

BIBM '09 Proceedings of the 2009 IEEE International Conference on Bioinformatics and Biomedicine
Implicit compression boosting with applications to self-indexing

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Space efficient indexes for string matching with don't cares

ISAAC'07 Proceedings of the 18th international conference on Algorithms and computation
Bidirectional search in a string with wavelet trees

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Succinct dictionary matching with no slowdown

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Faster compressed dictionary matching

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Computing matching statistics and maximal exact matches on compressed full-text indexes

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Orthogonal range searching on the RAM, revisited

Proceedings of the twenty-seventh annual symposium on Computational geometry
Succincter text indexing with wildcards

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Alphabet-independent compressed text indexing

ESA'11 Proceedings of the 19th European conference on Algorithms
Compressed text indexing with wildcards

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval

Quantified Score

Hi-index	5.23

Visualization

Abstract

We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)-positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity while maintaining the query time complexity of previous approaches by giving a compressed index requiring 2nH"k(T)+o(nlog@s)+O(n+dlogn) bits for a text T of length n over an alphabet of size @s containing d groups of wildcards. The new index is particularly favorable for larger alphabets and comparable for smaller alphabets, such as DNA. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying O(n)+O(dlognd) bits to also support efficient dictionary matching queries. We discuss how the space can be reduced further by a number of approaches and by allowing an increase in the worst case query time. We also present a new query algorithm for our wildcard indexes that can greatly reduce the query working space.