An efficient hybrid approach to correcting errors in short reads

Authors:
Zhiheng Zhao;Jianping Yin;Yong Li;Wei Xiong;Yubin Zhan
Affiliations:
School of Computer, National University of Defense Technology, Changsha, China;School of Computer, National University of Defense Technology, Changsha, China;School of Computer, National University of Defense Technology, Changsha, China;School of Computer, National University of Defense Technology, Changsha, China;School of Computer, National University of Defense Technology, Changsha, China
Venue:
MDAI'11 Proceedings of the 8th international conference on Modeling decisions for artificial intelligence
Year:
2011

Citing 9
Cited 0

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
A new approach to fragment assembly in DNA sequencing

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Fragment assembly with short reads

Bioinformatics
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
SHREC

Bioinformatics
Correction of sequencing errors in a mixed set of reads

Bioinformatics
Reptile

Bioinformatics
HiTEC

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-throughput sequencing technologies produce a large number of short reads that may contain errors. These sequencing errors constitute one of the major problems in analyzing such data. Many algorithms and software tools have been proposed to correct errors in short reads. However, the computational complexity limits their performance. In this paper, we propose a novel and efficient hybrid approach which is based on an alignment-free method combined with multiple alignments. We construct suffix arrays on all short reads to search the correct overlapping regions. For each correct overlapping region, we form multiple alignments for the substrings following the correct overlapping region to identify and correct the erroneous bases. Our approach can correct all types of errors in short reads produced by different sequencing platforms. Experiments show that our approach provides significantly higher accuracy and is comparable or even faster than previous approaches.