Graph-based clustering for finding distant relationships in a large set of protein sequences

Authors:
Hideya Kawaji;Yoichi Takenaka;Hideo Matsuda
Affiliations:
Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan;Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan;Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan
Venue:
Bioinformatics
Year:
2004

Citing 0
Cited 5

Exploiting homogeneity in protein sequence clusters for construction of protein family hierarchies

Pattern Recognition
An Approximate Distribution for the Normalized Cut

Journal of Mathematical Imaging and Vision
Clustering sequences by overlap

International Journal of Data Mining and Bioinformatics
Fixed-Parameter Algorithms for Graph-Modeled Date Clustering

TAMC '09 Proceedings of the 6th Annual Conference on Theory and Applications of Models of Computation
Use of ternary similarities in graph based clustering for protein structural family classification

Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Clustering of protein sequences is widely used for the functional characterization of proteins. However, it is still not easy to cluster distantly-related proteins, which have only regional similarity among their sequences. It is therefore necessary to develop an algorithm for clustering such distantly-related proteins. Results: We have developed a time and space efficient clustering algorithm. It uses a graph representation where its vertices and edges denote proteins and their sequence similarities above a certain cutoff score, respectively. It repeatedly partitions the graph by removing edges that have small weights, which correspond to low sequence similarities. To find the appropriate partitions, we introduce a score combining the normalized cut and a locally minimal cut capacities. Our method is applied to the entire 40 703 human proteins in SWISS-PROT and TrEMBL. The resulting clusters shows a 76% recall (20 529 proteins) of the 26 917 classified by InterPro. It also finds relationships not found by other clustering methods. Availability: The complete result of our algorithm for all the human proteins in SWISS-PROT and TrEMBL, and other supplementary information are available at http://motif.ics.es.osaka-u.ac.jp/Ncut-KL/