Independent component analysis: algorithms and applications
Neural Networks
Sequence - Evolution - Function: Computational Approaches in Comparative Genomics
Sequence - Evolution - Function: Computational Approaches in Comparative Genomics
A Nonlinear Mapping for Data Structure Analysis
IEEE Transactions on Computers
Bioinformatics
Biomedical Case Studies in Data Intensive Computing
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
High Performance Dimension Reduction and Visualization for Large High-Dimensional Data Analysis
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
The 19th International Symposium on High Performance Distributed Computing
Dimension reduction and visualization of large high-dimensional data via interpolation
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Modeling sequence and function similarity between proteins for protein functional annotation
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Twister: a runtime for iterative MapReduce
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Applying Twister to Scientific Applications
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Hi-index | 0.00 |
Modern biology is experiencing a rapid increase in data volumes that challenges our analytical skills and existing cyberinfrastructure. Exponential expansion of the Protein Sequence Universe (PSU), the protein sequence space, together with the costs and complexities of manual curation creates a major bottleneck in life sciences research. Existing resources lack scalable visualization tools that are instrumental for functional annotation. Here, we describe a multi-dimensional scaling (MDS) implementation to create a 3D embedding of the PSU that allows visualizing the relationships between large numbers of proteins. To demonstrate the method, we use sequence similarity scores as a measure of proximity. An example of the prokaryotic PSU shows that the low-dimensional representation preserves important grouping features such as relative proximity of functionally similar clusters and clear structural separation between clusters with specific and general functions. The advantages of the method and its implementation include the ability to scale to large numbers of sequences, integrate different similarity measures with other functional and experimental data, and facilitate protein annotation. Transdisciplinary approaches akin to the one described in this paper are urgently needed to quickly and efficiently translate the influx of new data into tangible innovations and groundbreaking discoveries.