Separating repeats in DNA sequence assembly

Authors:
John Kececioglu;Jun Ju
Affiliations:
Department of Computer Science, The University of Arizona, Tucson, AZ;Department of Computer Science, The University of Georgia, Athens, GA
Venue:
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Year:
2001

Citing 8
Cited 2

Numerical recipes in C: the art of scientific computing

Numerical recipes in C: the art of scientific computing
Large-scale assembly of DNA strings and space-efficient construction of suffix trees

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Inferring a DNA sequence from erroneous copies

Theoretical Computer Science - Special issue on algorithmic learning theory
ReAligner: a program for refining DNA sequence multi-alignments

RECOMB '97 Proceedings of the first annual international conference on Computational molecular biology
Algorithms for whole genome shotgun sequencing

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Trie-Based Data Structures for Sequence Assembly

CPM '97 Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching
Aligning Alignments

CPM '98 Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching
Primal-Dual Approximation Algorithms for Metric Facility Location and k-Median Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science

Genome Sequence Assembly: Algorithms and Issues

Computer
Correcting Base-Assignment Errors in Repeat Regions of Shotgun Assembly

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are overlaid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.