Clustering and Visualization of Large Protein Sequence Databases by Means of an Extension on the Self-Organizing Map

Authors:
Panu Somervuo;Teuvo Kohonen
Affiliations:
-;-
Venue:
DS '00 Proceedings of the Third International Conference on Discovery Science
Year:
2000

Citing 2
Cited 3

Self-organizing maps

Self-organizing maps
Convergence and ordering of Kohonen's batch map

Neural Computation

Comparison of Genomic Sequences Clustering Using Normalized Compression Distance and Evolutionary Distance

KES '08 Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part III
XML data clustering: An overview

ACM Computing Surveys (CSUR)
Soft topographic maps for clustering and classifying bacteria using housekeeping genes

Advances in Artificial Neural Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

New, more effective software tools are needed for the analysis and organization of the continually growing biological databases. An extension of the Self-Organizing Map (SOM) is used in this work for the clustering of all the 77,977 protein sequences of the SWISS-PROT database, release 37. In this method, unlike in some previous ones, the data sequences are not converted into histogram vectors in order to perform the clustering. Instead, a collection of true representative model sequences that approximate the contents of the database in a compact way is found automatically, based on the concept of the generalized median of symbol strings, after the user has defined any proper similarity measure for the sequences such as Smith-Waterman, BLAST, or FASTA. The FASTA method is used in this work. The benefits of the SOM and also those of its extension are fast computation, approximate representation of the large database by means of a much smaller, fixed number of model sequences, and an easy interpretation of the clustering by means of visualization. The complete sequence database is mapped onto a two-dimensional graphic SOM display, and clusters of similar sequences are then found and made visible by indicating the degree of similarity of the adjacent model sequences by shades of gray.