High performance computing workflow for protein functional annotation

Authors:
Larissa Stanberry;Yuan Liu;Bhanu Rekepalli;Paul Giblock;Roger Higdon;William Broomall
Affiliations:
Seattle Children's Research Institute (SCRI), DELSA, Global;JICS UT - ORNL;University of Tennessee â/Ă/Ş/, DELSA, Global;JICS UT - ORNL;Bioinformatics & High-Throughput Analysis Laboratory and High-throughput Analysis Core, SCRI DELSA Global;High-Throughput Analysis Core, SCRI/ DELSA Global
Venue:
Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
Year:
2013

Citing 7
Cited 0

The C programming language

The C programming language
Sequence - Evolution - Function: Computational Approaches in Comparative Genomics

Sequence - Evolution - Function: Computational Approaches in Comparative Genomics
Efficient Data Access for Parallel BLAST

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Bioinformatics
Manual curation is not sufficient for annotation of genomic databases

Bioinformatics
Apache airavata: a framework for distributed applications and computational workflows

Proceedings of the 2011 ACM workshop on Gateway computing environments
ScalaBLAST 2.0

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the PSU (Protein Sequence Universe) expands exponentially. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible whereas a high compute cost limits the utility of existing automated approaches. In this study, we built an automated workflow to enable large-scale protein annotation into existing orthologous groups using HPC (High Performance Computing) architectures. We developed a low complexity classification algorithm to assign proteins into bacterial COGs (Clusters of Orthologous Groups of proteins). Based on the PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool), the algorithm was validated on simulated and archaeal data to ensure at least 80% specificity and sensitivity. The workflow with highly scalable parallel applications for classification and sequence alignment was developed on XSEDE (Extreme Science and Engineering Discovery Environment) supercomputers. Using the workflow, we have classified one million newly sequenced bacterial proteins. With the rapid expansion of the PSU, the proposed workflow will enable scientists to annotate big genome data.