Intelligent Data Analysis for Protein Disorder Prediction

  • Authors:
  • Pedro Romero;Zoran Obradovic;A. Keith Dunker

  • Affiliations:
  • School of Electrical Engineering and Computer Science (E-mail: promero@eecs.wsu.edu);School of Electrical Engineering and Computer Science (E-mail: zoran@eecs.wsu.edu);Department of Biochemistry and Biophysics, Washington State University, Pullman, WA 99164, USA (E-mail: dunker@mail.wsu.edu)

  • Venue:
  • Artificial Intelligence Review - Issues on the application of data mining
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Although an ordered 3D structure is generally considered to be anecessary pre-condition for protein functionality, there are disorderedcounter examples found to have biological activity. The objectives ofour data mining project are: (1) to generalize from the limitedset of counter examples and then apply this knowledge to large databases of amino acid sequence in order to estimate commonness ofdisordered protein regions in nature, and (2) to determine whether thereare different types of protein disorder. For general disorderestimation, a neural network based predictor was designed and tested ondata built from several public domain data banks through a nontrivialsearch, statistical analysis and data dimensionality reduction. Inaddition, predictors for identification of family-specific disorder weredeveloped by extracting knowledge from databases generated throughmultiple sequence alignments of a known disordered sequence to otherhighly related proteins. Family-specific predictors were also integratedto test quality of general protein disorder identification from suchhybrid prediction systems. Out-of-sample cross validation performance ofseveral predictors was computed first, followed by tests on an unrelateddatabase of proteins with long disordered regions, and the applicationof few selected predictors to two large protein data banks:Nrl_3D, currently containing more than 10,000 protein fragmentsof known 3D structure, and Swiss Protein, having almost 60,000 proteinsequences. The obtained results provide evidence that long disorderedregions are common in nature, with an estimate that 11% of allthe residues in the Swiss Protein data bank belong to disordered regionsof length 40 or greater. The hypothesis that different protein disordertypes exist is supported by high specificity/low sensitivity resultsof two family-specific predictors, by hybrid systems outperforminggeneral models on a two-family test, and by existence of significantgaps in Swiss Protein vs. Nrl_3D disorder frequency estimates forboth families. These findings prompt the need for a revision in thecurrent understanding of protein structure and function, as well as forthe developing of improved disorder predictors that should haveimportant uses in biotechnology applications.