A high performance algorithm for clustering of large-scale protein mass spectrometry data using multi-core architectures

  • Authors:
  • Fahad Saeed;Jason D. Hoffert;Mark A. Knepper

  • Affiliations:
  • National Institutes of Health (NIH), Bethesda, MD;National Institutes of Health (NIH), Bethesda, MD;National Institutes of Health (NIH), Bethesda, MD

  • Venue:
  • Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

High-throughput mass spectrometers can produce thousands of peptide spectra from a single complex protein sample in a short amount of time. These data sets contain a substantial amount of redundancy (i.e. the same peptide is selected and identified multiple times in a single experiment) from peptides that may get selected multiple times in the liquid chromatography mass spectrometry (LC-MS/MS) experiment. The data from these mass spectrometers contain a substantial number of spectra that have low signal to noise (S/N) ratio and may not get interpreted due to poor quality. Recently, we presented a graph theoretic algorithm, CAMS (Clustering Algorithm for Mass Spectra) for clustering mass spectrometry data. CAMS utilized a novel metric, called a F-set, that allows accurate identification of the spectra that are similar with much higher accuracy and sensitivity than if single peak comparisons were performed. In this paper we present a multithreaded algorithm, called P-CAMS, for clustering of mass spectral data on multicore machines. The algorithm relies on intelligent matrix completion for graph construction and a load-balancing scheme for substantial speedups. We study the scalability performance of the proposed parallel algorithm on a multicore machine using synthetically generated spectra with parameters carefully chosen to mimic real-world mass spectrometry datasets. Real experimental datasets were also generated for quality assessment of the clustering results from the proposed algorithm. The results show that the proposed algorithms have scalable runtime performances and gives clustering results similar to a serial algorithm. The study also provides insight into the design of high performance algorithms for irregular problems in proteomics on many-core architectures.