Algorithms for Constructing a Consensus Sequence

  • Authors:
  • Paul Cull;Jim L. Holloway

  • Affiliations:
  • -;-

  • Venue:
  • Algorithms for Constructing a Consensus Sequence
  • Year:
  • 1991

Quantified Score

Hi-index 0.00

Visualization

Abstract

Biological and physical limitations require that DNA be sequenced in fragments. There are several approaches to obtain the appropriate sized fragments of DNA to sequence. The method of sequencing that we are interested in is loosely referred to as shotgun sequencing. Many copies of the genomic DNA to be sequenced are cleaved by one or more restriction endonucleases resulting in a multiset, S, of DNA fragments that are not ordered. DNA fragments are essentially selected at random from this multiset and sequenced. A consensus sequence is constructed by joining together fragments which overlap. (One hopes that the consensus sequence is very close to the original sequence.) Since errors occur reading the sequences, the overlaps must be approximate, not exact. This process of reassembly is similar to the NP-complete shortest common superstring problem [GMS80]. To simplify the problem we make the following assumptions. - An integer k can be supplied that defines the minimum acceptable overlap between two sequences. - There is a unique alignment of the sequence fragments such that all suffix/prefix overlaps are of length k or greater. - All suffix/prefix overlaps are exact (log inexact) matches. We define the string consensus problem and give three algorithms to solve it. We then define the log inexact string consensus problem and give three algorithms to solve it. We believe that the log inexact string consensus problem is closer to the problem of constructing a consensus sequence from shotgun data that biochemists are trying to solve than the problems previous approximation algorithms for the shortest common superstring problem.