RCSI: scalable similarity search in thousand(s) of genomes

Authors:
Sebastian Wandelt;Johannes Starlinger;Marc Bux;Ulf Leser
Affiliations:
Humboldt-Universität zu Berlin, Wissensmanagement in der Bioinformatik, Berlin, Germany;Humboldt-Universität zu Berlin, Wissensmanagement in der Bioinformatik, Berlin, Germany;Humboldt-Universität zu Berlin, Wissensmanagement in der Bioinformatik, Berlin, Germany;Humboldt-Universität zu Berlin, Wissensmanagement in der Bioinformatik, Berlin, Germany
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 28
Cited 0

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Fast and practical approximate string matching

Information Processing Letters
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Deformable Markov model templates for time-series pattern matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
An(other) Entropy-Bounded Compressed Suffix Tree

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Human genomes as email attachments

Bioinformatics
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Fast and accurate short read alignment with Burrows–Wheeler transform

Bioinformatics
Engineering a compressed suffix tree implementation

Journal of Experimental Algorithmics (JEA)
Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
Efficient Algorithms for Listing Combinatorial Structures

Efficient Algorithms for Listing Combinatorial Structures
Prefix tree indexing for similarity search and similarity joins on genomic data

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
CST++

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The variant call format and VCFtools

Bioinformatics
Indexing finite language representation of population genotypes

WABI'11 Proceedings of the 11th international conference on Algorithms in bioinformatics
Robust relative compression of genomes with random access

Bioinformatics
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Online windowed subsequence matching over probabilistic sequences

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
A generic framework for efficient and effective subsequence retrieval

Proceedings of the VLDB Endowment
On compressing and indexing repetitive sequences

Theoretical Computer Science
Efficient direct search on compressed genomic data

ICDE '13 Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of genetic diseases. A core operation in such studies is read mapping, i.e., finding all parts of a set of genomes which are within edit distance k to a given query sequence (k-approximate search). To achieve sufficient speed, current algorithms solve this problem only for one to-be-searched genome and compute only approximate solutions, i.e., they miss some k- approximate occurrences. We present RCSI, Referentially Compressed Search Index, which scales to a thousand genomes and computes the exact answer. It exploits the fact that genomes of different individuals of the same species are highly similar by first compressing the to-be-searched genomes with respect to a reference genome. Given a query, RCSI then searches the reference and all genome-specific individual differences. We propose efficient data structures for representing compressed genomes and present algorithms for scalable compression and similarity search. We evaluate our algorithms on a set of 1092 human genomes, which amount to approx. 3 TB of raw data. RCSI compresses this set by a ratio of 450:1 (26:1 including the search index) and answers similarity queries on a mid-class server in 15 ms on average even for comparably large error thresholds, thereby significantly outperforming other methods. Furthermore, we present a fast and adaptive heuristic for choosing the best reference sequence for referential compression, a problem that was never studied before at this scale.