Inferring an original sequence from erroneous copies: a Bayesian approach

  • Authors:
  • Jonathan M. Keith;Peter Adams;Darryn Bryant;Keith R. Mitchelson;Duncan A. E. Cochran;Gita H. Lala

  • Affiliations:
  • Department of Mathematics, The University of Queensland, St Lucia, Qld 4072, Australia;Department of Mathematics, The University of Queensland, St Lucia, Qld 4072, Australia;Department of Mathematics, The University of Queensland, St Lucia, Qld 4072, Australia;Australian Genome Research Facility, The University of Queensland, St Lucia, Qld 4072, Australia and Institute for Molecular Bioscience, The University of Queensland, St Lucia, Qld 4072, Australia;Department of Mathematics, The University of Queensland, St Lucia, Qld 4072, Australia and Australian Genome Research Facility, The University of Queensland, St Lucia, Qld 4072, Australia;Department of Mathematics, The University of Queensland, St Lucia, Qld 4072, Australia and Australian Genome Research Facility, The University of Queensland, St Lucia, Qld 4072, Australia

  • Venue:
  • APBC '03 Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics 2003 - Volume 19
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper considers the problem of inferring an original sequence from a number of erroneous copies. The problem arises in DNA sequencing, particularly in the context of emerging technologies that provide high throughput or other advantages, but at the cost of introducing many errors. We develop a Bayesian probabilistic model of the introduction of errors, and search for a sequence that has maximum posterior probability with respect to the model. We present results of extensive tests in which error-prone sequencing of real DNA was simulated. The results obtained using the new approach are compared to results obtained by deriving a consensus sequence from a multiple sequence alignment. We find that a significant improvement in accuracy is obtained using the new approach. The implication is that high error levels need not be a barrier to the adoption of sequencing technologies that are in other respects promising, because most errors can be detected and corrected using a small number of reads.