Statistical estimate for the size of the protein structural vocabulary

Authors:
Xuezheng Fu;Bernard Chen;Yi Pan;Robert W. Harrison
Affiliations:
Department of Computer Science, Georgia State University, Atlanta, GA;Department of Computer Science, Georgia State University, Atlanta, GA;Department of Computer Science, Georgia State University, Atlanta, GA;Department of Computer Science, Georgia State University, Atlanta, GA and Department of Biology, Georgia State University, Atlanta, GA
Venue:
ISBRA'07 Proceedings of the 3rd international conference on Bioinformatics research and applications
Year:
2007

Citing 4
Cited 0

Data clustering: a review

ACM Computing Surveys (CSUR)
An empirical comparison of four initialization methods for the K-Means algorithm

Pattern Recognition Letters
Problems in gene clustering based on gene expression data

Journal of Multivariate Analysis
FIK Model: Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery

BIBE '06 Proceedings of the Sixth IEEE Symposium on BionInformatics and BioEngineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The concept of structural clusters defining the vocabulary of protein structure is one of the central concepts in the modern theory of protein folding. Typically clusters are found by a variation of the K-means or K-NN algorithm. In this paper we study approaches to estimating the number of clusters in data. The optimal number of clusters is believed to result in a reliable clustering. Stability with respect to bootstrap sampling was adapted as the cluster validation measure for estimating the reliable clustering. In order to test this algorithm, six random subsets were drawn from the unique chains in the PDB. The algorithm converged in each case to unique set of reliable clusters. Since these clusters were drawn randomly from the total current set of chains, counting the number of coincidences and using basic sampling theory provides a rigorous statistical estimate of the number of unique clusters in the dataset.