A hybrid framework for protein sequence clustering and classification using signature motif information

  • Authors:
  • Wei-Bang Chen;Chengcui Zhang

  • Affiliations:
  • Department of Computer and Information Sciences, University of Alabama at Birmingham, Birmingham, AL, USA;(Correspd. Tel.: +1 205 934 8606/ Fax: +1 205 934 5473/ Email: zhang@cis.uab.edu) Department of Computer and Information Sciences, University of Alabama at Birmingham, Birmingham, AL, USA

  • Venue:
  • Integrated Computer-Aided Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.02

Visualization

Abstract

In this paper, we propose an unsupervised hybrid framework for protein sequence clustering and classification which incorporates protein structural motif information. The proposed framework consists of three stages: protein structural motif scan, hybrid clustering, and sequence classification. The incorporation of protein structural motif detected by ScanProsite service provides a better measurement in calculating the sequence similarity. The proposed two-phase hybrid clustering approach combines the strengths of the hierarchical and the partition clustering. Phase I adopts the hierarchical agglomerative clustering to pre-cluster multi-aligned sequences. Phase II performs the partition clustering which initiates its partition based on the result from Phase I and uses profile Hidden Markov Models (HMMs) to represent clusters. The profile HMMs are then stored in the database for unknown sequences classification, which is done by finding the best alignment of a sequence to each existing profile HMM. Our experiments demonstrate the effectiveness and the efficiency of the proposed framework for biological sequence clustering and classification.