Statistical estimate for the size of the protein structural vocabulary

  • Authors:
  • Xuezheng Fu;Bernard Chen;Yi Pan;Robert W. Harrison

  • Affiliations:
  • Department of Computer Science, Georgia State University, Atlanta, GA;Department of Computer Science, Georgia State University, Atlanta, GA;Department of Computer Science, Georgia State University, Atlanta, GA;Department of Computer Science, Georgia State University, Atlanta, GA and Department of Biology, Georgia State University, Atlanta, GA

  • Venue:
  • ISBRA'07 Proceedings of the 3rd international conference on Bioinformatics research and applications
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The concept of structural clusters defining the vocabulary of protein structure is one of the central concepts in the modern theory of protein folding. Typically clusters are found by a variation of the K-means or K-NN algorithm. In this paper we study approaches to estimating the number of clusters in data. The optimal number of clusters is believed to result in a reliable clustering. Stability with respect to bootstrap sampling was adapted as the cluster validation measure for estimating the reliable clustering. In order to test this algorithm, six random subsets were drawn from the unique chains in the PDB. The algorithm converged in each case to unique set of reliable clusters. Since these clusters were drawn randomly from the total current set of chains, counting the number of coincidences and using basic sampling theory provides a rigorous statistical estimate of the number of unique clusters in the dataset.