An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection

  • Authors:
  • Sanghamitra Bandyopadhyay

  • Affiliations:
  • Machine Intelligence Unit, Indian Statistical Institute, 203 B.T. Road, Kolkata 700 108, India

  • Venue:
  • Fuzzy Sets and Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.21

Visualization

Abstract

In this article, we propose an efficient technique for classifying amino acid sequences into different superfamilies. The proposed method first extracts 20 features from a set of training sequences. The extracted features are such that they take into consideration the probabilities of occurrences of the amino acids in the different positions of the sequences. Thereafter, a genetic fuzzy clustering approach is used to automatically evolve a set of prototypes representing each class. The characteristic of this clustering method is that it does not require the a priori information about the number of clusters, and is also able to come out of locally optimal configurations. Finally, the nearest neighbor rule is used to classify an unknown sequence into a particular superfamily class, based on its proximity to the prototypes evolved using the genetic fuzzy clustering technique. This results in a significant improvement in the time required for classifying unknown sequences. Results for three superfamilies, namely globin, trypsin and ras, demonstrate the effectiveness of the proposed technique with respect to the case where all the training sequences are considered for classification using the same set of features. Comparison with the well-known technique BLAST also shows that the proposed method provides a significant improvement in terms of the time required for classification while providing comparable classification performance.