High performance computing workflow for protein functional annotation

  • Authors:
  • Larissa Stanberry;Yuan Liu;Bhanu Rekepalli;Paul Giblock;Roger Higdon;William Broomall

  • Affiliations:
  • Seattle Children's Research Institute (SCRI), DELSA, Global;JICS UT - ORNL;University of Tennessee â/Ă/Ş/, DELSA, Global;JICS UT - ORNL;Bioinformatics & High-Throughput Analysis Laboratory and High-throughput Analysis Core, SCRI DELSA Global;High-Throughput Analysis Core, SCRI/ DELSA Global

  • Venue:
  • Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the PSU (Protein Sequence Universe) expands exponentially. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible whereas a high compute cost limits the utility of existing automated approaches. In this study, we built an automated workflow to enable large-scale protein annotation into existing orthologous groups using HPC (High Performance Computing) architectures. We developed a low complexity classification algorithm to assign proteins into bacterial COGs (Clusters of Orthologous Groups of proteins). Based on the PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool), the algorithm was validated on simulated and archaeal data to ensure at least 80% specificity and sensitivity. The workflow with highly scalable parallel applications for classification and sequence alignment was developed on XSEDE (Extreme Science and Engineering Discovery Environment) supercomputers. Using the workflow, we have classified one million newly sequenced bacterial proteins. With the rapid expansion of the PSU, the proposed workflow will enable scientists to annotate big genome data.