MPSCAN: fast localisation of multiple reads in genomes

Authors:
Eric Rivals;Leena Salmela;Petteri Kiiskinen;Petri Kalsi;Jorma Tarhio
Affiliations:
LIRMM, CNRS and Université de Montpellier 2, Montpellier, France;Helsinki University of Technology, TKK, Finland;Helsinki University of Technology, TKK, Finland;Helsinki University of Technology, TKK, Finland;Helsinki University of Technology, TKK, Finland
Venue:
WABI'09 Proceedings of the 9th international conference on Algorithms in bioinformatics
Year:
2009

Citing 9
Cited 4

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Average complexity of exact and approximate multiple string matching

Theoretical Computer Science
Multiseed Lossless Filtration

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Multipattern string matching with q-grams

Journal of Experimental Algorithmics (JEA)
On the complexity of the spaced seeds

Journal of Computer and System Sciences
SOAP

Bioinformatics
Hardness of optimal spaced seed design

Journal of Computer and System Sciences
SeqMap

Bioinformatics
ZOOM! Zillions of oligos mapped

Bioinformatics

Worst case efficient single and multiple string matching in the RAM model

IWOCA'10 Proceedings of the 21st international conference on Combinatorial algorithms
Seed design framework for mapping SOLiD reads

RECOMB'10 Proceedings of the 14th Annual international conference on Research in Computational Molecular Biology
Worst-case efficient single and multiple string matching on packed texts in the word-RAM model

Journal of Discrete Algorithms
Fast multiple string matching using streaming SIMD extensions technology

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

With Next Generation Sequencers, sequence based transcriptomic or epigenomic assays yield millions of short sequence reads that need to be mapped back on a reference genome. The upcoming versions of these sequencers promise even higher sequencing capacities; this may turn the read mapping task into a bottleneck for which alternative pattern matching approaches must be experimented. We present an algorithm and its implementation, called mpscan, which uses a sophisticated filtration scheme to match a set of patterns/reads exactly on a sequence. MPSCAN can search for millions of reads in a single pass through the genome without indexing its sequence. Moreover, we show that MPSCAN offers an optimal average time complexity, which is sublinear in the text length, meaning that it does not need to examine all sequence positions. Comparisons with BLAT-like tools and with six specialised read mapping programs (like BOWTIE or ZOOM) demonstrate that mpscan also is the fastest algorithm in practice for exact matching. Our accuracy and scalability comparisons reveal that some tools are inappropriate for read mapping. Moreover, we provide evidence suggesting that exact matching may be a valuable solution in some read mapping applications. As most read mapping programs somehow rely on exact matching procedures to perform approximate pattern mapping, the filtration scheme we experimented may reveal useful in the design of future algorithms. The absence of genome index gives mpscan its low memory requirement and flexibility that let it run on a desktop computer and avoids a time-consuming genome preprocessing.