Visualizing the protein sequence universe

  • Authors:
  • Larissa Stanberry;Roger Higdon;Winston Haynes;Natali Kolker;William Broomall;Saliya Ekanayake;Adam Hughes;Yang Ruan;Judy Qiu;Eugene Kolker;Geoffrey Fox

  • Affiliations:
  • Seattle Children's Research Institute, Seattle, USA;Seattle Children's Research Institute, Seattle, USA;Seattle Children's Research Institute, Seattle, USA;Seattle Children's Research Institute, Seattle, USA;Seattle Children's Research Institute, Seattle, USA;Indiana University, Boomington, USA;Indiana University, Bloomington, USA;Indiana University, Bloomington, USA;Indiana University, Bloomington, USA;Seattle Children's Research Institute, Seattle, USA;Indiana University, Bloomington, USA

  • Venue:
  • Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Modern biology is experiencing a rapid increase in data volumes that challenges our analytical skills and existing cyberinfrastructure. Exponential expansion of the Protein Sequence Universe (PSU), the protein sequence space, together with the costs and complexities of manual curation creates a major bottleneck in life sciences research. Existing resources lack scalable visualization tools that are instrumental for functional annotation. Here, we describe a multi-dimensional scaling (MDS) implementation to create a 3D embedding of the PSU that allows visualizing the relationships between large numbers of proteins. To demonstrate the method, we use sequence similarity scores as a measure of proximity. An example of the prokaryotic PSU shows that the low-dimensional representation preserves important grouping features such as relative proximity of functionally similar clusters and clear structural separation between clusters with specific and general functions. The advantages of the method and its implementation include the ability to scale to large numbers of sequences, integrate different similarity measures with other functional and experimental data, and facilitate protein annotation. Transdisciplinary approaches akin to the one described in this paper are urgently needed to quickly and efficiently translate the influx of new data into tangible innovations and groundbreaking discoveries.