Numerical recipes in C: the art of scientific computing
Numerical recipes in C: the art of scientific computing
Large-scale assembly of DNA strings and space-efficient construction of suffix trees
STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Inferring a DNA sequence from erroneous copies
Theoretical Computer Science - Special issue on algorithmic learning theory
ReAligner: a program for refining DNA sequence multi-alignments
RECOMB '97 Proceedings of the first annual international conference on Computational molecular biology
Algorithms for whole genome shotgun sequencing
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Trie-Based Data Structures for Sequence Assembly
CPM '97 Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching
CPM '98 Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching
Primal-Dual Approximation Algorithms for Metric Facility Location and k-Median Problems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Correcting Base-Assignment Errors in Repeat Regions of Shotgun Assembly
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Hi-index | 0.00 |
One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are overlaid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.