Machine learning and bioinformatics

  • Authors:
  • Haesun Park;Hyunsoo Kim

  • Affiliations:
  • -;-

  • Venue:
  • Machine learning and bioinformatics
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

For many applications where expensive updating of transactions is frequently required, it is desirable to develop incremental and decremental machine learning algorithms, which can efficiently compute the updated decision function when data points are appended or deleted. An incremental and decremental kernel discriminant analysis (KDA) and an incremental and decremental least squares support vector machine (SVM) have been introduced. The proposed incremental and decremental KDA avoids updating the computationally expensive eigenvalue decomposition (EVD), which would be necessary when we wish to design an incremental and decremental classifier by KDA based on the EVD. Gene expression data often contain missing expression values. Effective missing value estimation methods are needed since many algorithms for gene expression data analysis require a complete matrix of the gene expression data. The local least squares imputation method (LLSimpute) has been proposed, which exploits local similarity structures in the data as well as least squares optimization process. The proposed LLSimpute method shows the best performance among imputation methods tested on various data sets and percentages of missing values. The identification of discriminant genes is also one of the most fundamental steps in microarray gene expression data analysis to suggest representative genes to be explored in medical research. A multiclass gene selection method based on generalized linear discriminant analysis has been designed for classification of cancer subtypes. Four genes are proposed for subtype classification of leukemia, which yield only one misclassification during leave-one-out cross validation (LOOCV), and nine genes are suggested for subtype classification of small round blue cell tumor, which produce complete classification in the LOOCV test. Furthermore, a three-stage framework for gene expression data analysis regarding continuous phenotypes has been introduced by L 1-norm support vector regression as well. Finally, a protein secondary structure prediction method, i.e., SVMpsi, has been developed to improve the current level of prediction accuracy of SVM approach by incorporating PSI-BLAST PSSM profiles. For the first time, SVM has been successfully applied to protein solvent accessibility prediction with a three-dimensional local descriptor as an intermediate step for the prediction of protein tertiary structure.