Seed-set construction by equi-entropy partitioning for efficient and sensitive short-read mapping

Authors:
Kouichi Kimura;Asako Koike;Kenta Nakai
Affiliations:
Central Research Laboratory, Hitachi Ltd., Tokyo, Japan;Central Research Laboratory, Hitachi Ltd., Tokyo, Japan;The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
Venue:
WABI'11 Proceedings of the 11th international conference on Algorithms in bioinformatics
Year:
2011

Citing 6
Cited 0

Compact pat trees

Compact pat trees
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching

The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
Fast and accurate short read alignment with Burrows–Wheeler transform

Bioinformatics
SOAP2

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spaced seeds have been shown to be superior to continuous seeds for efficient and sensitive homology search based on the seed-and-extend paradigm. Much the same is true in genome mapping of high-throughput short-read data. However, a highly sensitive search with multiple spaced patterns often requires the use of a great amount of index data. We propose a novel seed-set construction method for efficient and sensitive genome mapping of short reads with relatively high error rates, which uses only continuous seeds of variable length allowing a few errors. The seed lengths and allowable error positions are optimized on the basis of entropy, which is a measure of ambiguity or repetitiveness of mapping positions. These seeds can be searched efficiently using the Burrows-Wheeler transform of the reference genome. Evaluation using actual biological SOLiD sequence data demonstrated that our method was competitive in speed and sensitivity using much less memory and disk space in comparison to spaced-seed methods.