Gene-finding via tandem mass spectrometry

  • Authors:
  • Ting Chen

  • Affiliations:
  • Department of Mathematics, University of Southern, California, Los Angeles, CA

  • Venue:
  • RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a new gene-finding methodology that combines high performance liquid chromatograph (HPLC)-tandem mass spectrometry experiments with a fast computer algorithm to locate coding regions and introns. Proteins are first extracted from cells and digested by enzymes, and then the resulting peptides are separated and analyzed by HPLC-tandem mass spectrometry. We designed an algorithm to find DNA coding sequences, corresponding to open reading frames (ORF), in the genome such that their translated amino acid sequences are optimally correlated with these tandem mass spectra. In this algorithm, we also allow one gap, corresponding to an intron, between two DNA coding sequences, such that their concatenation becomes one coding sequence. Finally, the algorithm assembles these candidate coding sequences and introns into gene structures. Our algorithm was implemented to predict genes on 4 contigs with a total of 123 kbps using two sets of simulated digestion- HPLC-tandem mass spectrometry data of 2523 Caenorhabditis elegans Chromosome IV proteins, digested by trypsin and Asp-N respectively. Among 15 annotated genes in the forward strand, all 98 exons are hit by the predicted no-gap coding sequences, and 60 out of 83 introns are correctly predicted. We also tested gene structure prediction in a contig containing 3 genes. Combining splicing site predictions with predicted coding sequences and introns, we found all 3 gene structures.