Compressed indexing and local alignment of DNA

Authors:
T. W. Lam;W. K. Sung;S. L. Tam;C. K. Wong;S. M. Yiu
Affiliations:
-;-;-;-;-
Venue:
Bioinformatics
Year:
2008

Citing 0
Cited 16

Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
Fast and accurate NCBI BLASTP: acceleration with multiphase FPGA-based prefiltering

Proceedings of the 24th ACM International Conference on Supercomputing
Approximate all-pairs suffix/prefix overlaps

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Indexing similar DNA sequences

AAIM'10 Proceedings of the 6th international conference on Algorithmic aspects in information and management
Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
A brief survey on sequence classification

ACM SIGKDD Explorations Newsletter
UASMAs (universal automated SNP mapping algorithms): a set of algorithms to instantaneously map SNPs in real time to aid functional SNP discovery

Proceedings of the VLDB Endowment
Compressed directed acyclic word graph with application in local alignment

COCOON'11 Proceedings of the 17th annual international conference on Computing and combinatorics
On the number of elements to reorder when updating a suffix array

Journal of Discrete Algorithms
Approximate all-pairs suffix/prefix overlaps

Information and Computation
Unified view of backward backtracking in short read mapping

Algorithms and Applications
ALAE: accelerating local alignment with affine gap exactly in biosequence databases

Proceedings of the VLDB Endowment
A generic framework for efficient and effective subsequence retrieval

Proceedings of the VLDB Endowment
RasterZip: compressing network monitoring data with support for partial decompression

Proceedings of the 2012 ACM conference on Internet measurement conference
Improving regular-expression matching on strings using negative factors

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A Compressed Suffix Tree Based Implementation With Low Peak Memory Usage

Electronic Notes in Theoretical Computer Science (ENTCS)

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). Without indexing, one can use dynamic programming to find all the local alignments between a text T and a pattern P in O(|T||P|) time, but this would be too slow when the text is of genome scale (e.g. aligning a gene with the human genome would take tens to hundreds of hours). In practice, biologists use heuristic-based software such as BLAST, which is very efficient but does not guarantee to find all local alignments. Results: In this article, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments. Experiments reveal that BWT-SW is very efficient (e.g. aligning a pattern of length 3 000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically for a simpler similarity model (with gaps disallowed), and we show that the expected running time is O(|T|0.628|P|) for random strings. As far as we know, BWT-SW is the first practical tool that can find all local alignments. Yet BWT-SW is not meant to be a replacement of BLAST, as BLAST is still several times faster than BWT-SW for long patterns and BLAST is indeed accurate enough in most cases (we have used BWT-SW to check against the accuracy of BLAST and found that only rarely BLAST would miss some significant alignments). Availability: www.cs.hku.hk/~ckwong3/bwtsw Contact: twlam@cs.hku.hk