Pattern Analysis and Prediction of O-Linked Glycosylation Sites in Protein by Principal Component Subspace Analysis

  • Authors:
  • Yen-Wei Chen;Xuemei Yang;Masahiro Ito;Ikuko Nishikawa

  • Affiliations:
  • Elect & Information Eng. School, Central South Univ. of Forestry and Technology, Changsha 410004, China and College of Information Science and Eng., Ritsumeikan Univ., Shiga, 525-8577, Japan;College of Information Science and Eng., Ritsumeikan Univ., Shiga, 525-8577, Japan and Department of Mathematics, Xianyang Normal Univ., Xianyang 712000, China;College of Information Science and Eng., Ritsumeikan Univ., Shiga, 525-8577, Japan;College of Information Science and Eng., Ritsumeikan Univ., Shiga, 525-8577, Japan

  • Venue:
  • KES '07 Knowledge-Based Intelligent Information and Engineering Systems and the XVII Italian Workshop on Neural Networks on Proceedings of the 11th International Conference
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Glycosylation is one of the most important post-translation modifications steps in the synthesis of membrane and secreted proteins and more than half of all proteins are glycosylated. In this paper, we propose a principal component analysis (PCA) based subspace approach for pattern analysis and prediction of O-glycosylation sites in protein. PCA is used to find principal components and subspaces of glycosylated residues and nonglycoslylated residues, respectively. From the calculated principal compoents, we found that the glycosylted proteins all have a high serine, threonine and proline content. The prediction can be viewed as a 4-classes classification problem or 2-classes classification problems. We project the protein sequence (test vector) into each subspace and calculate the distance between the test vector and its projection on the subspace. The protein sequence can be classified into the "nearest" class. The prediction accuracy for nonglycosylated sites (negative sites) is about 70%-90%, and the accuracy for O-glycosylated sites (positive sites) is about 70%-100%.