Improving phosphopeptide identification in shotgun proteomics by supervised filtering of peptide-spectrum matches

  • Authors:
  • Sujun Li;Randy J. Arnold;Haixu Tang;Predrag Radivojac

  • Affiliations:
  • Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, U.S.A.;Department of Chemistry, Indiana University, Bloomington, Indiana, U.S.A.;Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, U.S.A.;Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, U.S.A.

  • Venue:
  • Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the important objectives in mass spectrometry-based proteomics is the identification of post-translationally modified sites in cellular and extracellular proteomes. Proteomics techniques have been particularly effective in studying protein phosphorylation, where tens of thousands of new sites have been recently discovered in all domains of life. Such massive discovery of new sites has been facilitated by progress in affinity enrichment techniques, high-throughput analytical platforms that couple liquid chromatography (LC) and tandem mass spectrometry (MS/MS), and also powerful computational tools that assign peptides to tandem mass spectra. In this work we focus on computational protocols for identifying phosphoproteins, phosphopeptides, and phosphosites. Although the current tools already provide solid results, most methods have not been tuned to exploit particular sequence and physicochemical properties of phosphopeptides or the peculiarities of their fragment spectra. Therefore, novel algorithms can be designed to increase the sensitivity of phosphosite identification. Here we describe a machine learning-based method that improves the identification of phosphopeptides in LC-MS/MS experiments. Our algorithm is applied as a post-processing step to a standard database search. It assigns a probability score to each peptide-spectrum match (PSM) corresponding to a phosphopeptide, based on the sequence and spectral features of the peptide and its assigned fragment spectra as well as the biological propensity of particular residues in the peptide to be phosphorylated. The algorithm is based on a simple but robust logistic regression model and is used together with a conventional search engine (here, MASCOT) to filter out the PSMs with the lowest probability of being correctly identified. Our protocol was tested on two large phosphoproteomics data sets on which it increased the number of identified phosphopeptides by 10-15% compared to the conventional scoring algorithms at the same false discovery rate threshold of 1%.