Reference-based alignment in large sequence databases

Authors:
Panagiotis Papapetrou;Vassilis Athitsos;George Kollios;Dimitrios Gunopulos
Affiliations:
Boston University;University of Texas at Arlington;Boston University;University of Athens and UC Riverside
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 20
Cited 6

Algorithms for approximate string matching

Information and Control
Pattern-matching and text-compression algorithms

ACM Computing Surveys (CSUR)
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
A fast string searching algorithm

Communications of the ACM
Whole-Genome DNA Sequencing

Computing in Science and Engineering
Sublinear Expected Time Approximate String Matching and Biological

Sublinear Expected Time Approximate String Matching and Biological
Better filtering with gapped q-grams

Fundamenta Informaticae - Special issue on computing patterns in strings
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
DSIM: A Distance-Based Indexing Method for Genomic Sequences

BIBE '05 Proceedings of the Fifth IEEE Symposium on Bioinformatics and Bioengineering
Reference-based indexing of sequence databases

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Fast nGram-based string search over data encoded using algebraic signatures

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Compressed indexing and local alignment of DNA

Bioinformatics
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Approximate embedding-based subsequence matching of time series

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Indexing DNA sequences using q-grams

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Embedding-based subsequence matching in time-series databases

ACM Transactions on Database Systems (TODS)
A generic framework for efficient and effective subsequence retrieval

Proceedings of the VLDB Endowment
WHAM: A High-Throughput Sequence Alignment Method

ACM Transactions on Database Systems (TODS)
Approximate regional sequence matching for genomic databases

The VLDB Journal — The International Journal on Very Large Data Bases
RCSI: scalable similarity search in thousand(s) of genomes

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a novel method, called Reference-Based String Alignment (RBSA), that speeds up retrieval of optimal subsequence matches in large databases of sequences under the edit distance and the Smith-Waterman similarity measure. RBSA operates using the assumption that the optimal match deviates by a relatively small amount from the query, an amount that does not exceed a prespecified fraction of the query length. RBSA has an exact version that guarantees no false dismissals and can handle large queries efficiently. An approximate version of RBSA is also described, that achieves significant additional improvements over the exact version, with negligible losses in retrieval accuracy. RBSA performs filtering of candidate matches using precomputed alignment scores between the database sequence and a set of fixed-length reference sequences. At query time, the query sequence is partitioned into segments of length equal to that of the reference sequences. For each of those segments, the alignment scores between the segment and the reference sequences are used to efficiently identify a relatively small number of candidate subsequence matches. An alphabet collapsing technique is employed to improve the pruning power of the filter step. In our experimental evaluation, RBSA significantly outperforms state-of-the-art biological sequence alignment methods, such as q-grams, BLAST, and BWT.