Compressed q-Gram Indexing for Highly Repetitive Biological Sequences

Authors:
Francisco Claude;Antonio Farina;Miguel A. Martínez-Prieto;Gonzalo Navarro
Affiliations:
-;-;-;-
Venue:
BIBE '10 Proceedings of the 2010 IEEE International Conference on Bioinformatics and Bioengineering
Year:
2010

Citing 0
Cited 10

Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Space efficient wavelet tree construction

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Indexes for highly repetitive document collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Lightweighting the web of data through compact RDF/HDT

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Self-Indexed Grammar-Based Compression

Fundamenta Informaticae
DACs: Bringing direct access to variable-length codes

Information Processing and Management: an International Journal
Compressed suffix trees for repetitive texts

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
On compressing and indexing repetitive sequences

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

The study of compressed storage schemes for highly repetitive sequence collections has been recently boosted by the availability of cheaper sequencing technologies and the flood of data they promise to generate. Such a storage scheme may range from the simple goal of retrieving whole individual sequences to the more advanced one of providing fast searches in the collection. In this paper we study alternatives to implement a particularly popular index, namely, the one able of finding all the positions in the collection of substrings of fixed length ($q$-grams). We introduce two novel techniques and show they constitute practical alternatives to handle this scenario. They excel particularly in two cases: when $q$ is small (up to 6), and when the collection is extremely repetitive (less than 0.01% mutations).