Granular support vector machines with association rules mining for protein homology prediction

  • Authors:
  • Yuchun Tang;Bo Jin;Yan-Qing Zhang

  • Affiliations:
  • Department of Computer Science, Georgia State University, P.O. Box 3994, Atlanta, GA 30302, USA;Department of Computer Science, Georgia State University, P.O. Box 3994, Atlanta, GA 30302, USA;Department of Computer Science, Georgia State University, P.O. Box 3994, Atlanta, GA 30302, USA

  • Venue:
  • Artificial Intelligence in Medicine
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Objective:: Protein homology prediction between protein sequences is one of critical problems in computational biology. Such a complex classification problem is common in medical or biological information processing applications. How to build a model with superior generalization capability from training samples is an essential issue for mining knowledge to accurately predict/classify unseen new samples and to effectively support human experts to make correct decisions. Methodology:: A new learning model called granular support vector machines (GSVM) is proposed based on our previous work. GSVM systematically and formally combines the principles from statistical learning theory and granular computing theory and thus provides an interesting new mechanism to address complex classification problems. It works by building a sequence of information granules and then building support vector machines (SVM) in some of these information granules on demand. A good granulation method to find suitable granules is crucial for modeling a GSVM with good performance. In this paper, we also propose an association rules-based granulation method. For the granules induced by association rules with high enough confidence and significant support, we leave them as they are because of their high ''purity'' and significant effect on simplifying the classification task. For every other granule, a SVM is modeled to discriminate the corresponding data. In this way, a complex classification problem is divided into multiple smaller problems so that the learning task is simplified. Results and conclusions:: The proposed algorithm, here named GSVM-AR, is compared with SVM by KDDCUP04 protein homology prediction data. The experimental results show that finding the splitting hyperplane is not a trivial task (we should be careful to select the association rules to avoid overfitting) and GSVM-AR does show significant improvement compared to building one single SVM in the whole feature space. Another advantage is that the utility of GSVM-AR is very good because it is easy to be implemented. More importantly and more interestingly, GSVM provides a new mechanism to address complex classification problems.