Markov model recognition and classification of DNA/protein sequences within large text databases

  • Authors:
  • Jonathan D. Wren;William H. Hildebrand;Sreedevi Chandrasekaran;Ulrich Melcher

  • Affiliations:
  • Advanced Center for Genome Technology, Stephenson Research and Technology Center, Department of Botany and Microbiology, The University of Oklahoma 101 David L. Boren Blvd. Rm 2025, Norman, OK 73 ...;Department of Microbiology and Immunology, The University of Oklahoma Health Sciences Center Oklahoma City, OK, USA;Department of Microbiology and Immunology, The University of Oklahoma Health Sciences Center Oklahoma City, OK, USA;Department of Biochemistry and Molecular Biology, Oklahoma State University Stillwater, OK, USA

  • Venue:
  • Bioinformatics
  • Year:
  • 2005

Quantified Score

Hi-index 3.84

Visualization

Abstract

Motivation: Short sequence patterns frequently define regions of biological interest (binding sites, immune epitopes, primers, etc.), yet a large fraction of this information exists only within the scientific literature and is thus difficult to locate via conventional means (e.g. keyword queries or manual searches). We describe herein a system to accurately identify and classify sequence patterns from within large corpora using an n-gram Markov model (MM). Results: As expected, on test sets we found that identification of sequences with limited alphabets and/or regular structures such as nucleic acids (non-ambiguous) and peptide abbreviations (3-letter) was highly accurate, whereas classification of symbolic (1-letter) peptide strings with more complex alphabets was more problematic. The MM was used to analyze two very large, sequence-containing corpora: over 7.75 million Medline abstracts and 9000 full-text articles from Journal of Virology. Performance was benchmarked by comparing the results with Journal of Virology entries in two existing manually curated databases: VirOligo and the HLA Ligand Database. Performance estimates were 98 ± 2% precision/84% recall for primer identification and classification and 67 ± 6% precision/85% recall for peptide epitopes. We also find a dramatic difference between the amounts of sequence-related data reported in abstracts versus full text. Our results suggest that automated extraction and classification of sequence elements is a promising, low-cost means of sequence database curation and annotation. Availability: MM routine and datasets are available upon request. Contact: Jonathan.Wren@OU.edu