Role and Results of statistical methods in protein fold class prediction

Authors:
L. Edler;J. Grassmann;S. Suhai
Affiliations:
Biostatistics Unit - R0700 German Cancer Research Center P.O. Box 10 19 49, D-69120 Heidelberg, Germany;Biostatistics Unit - R0700 German Cancer Research Center P.O. Box 10 19 49, D-69120 Heidelberg, Germany;Department of Molecular Biophysics German Cancer Research Center P.O. Box 10 19 49, D-69120 Heidelberg, Germany
Venue:
Mathematical and Computer Modelling: An International Journal
Year:
2001

Citing 5
Cited 3

Discriminant Adaptive Nearest Neighbor Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Neural networks and logistic regression: Part I

Computational Statistics & Data Analysis
Molecular Modeling of Proteins and Mathematical Prediction of Protein Structure

SIAM Review
Protein Architecture: A Practical Approach

Protein Architecture: A Practical Approach
Protein Fold Class Prediction: New Methods of Statistical Classification

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology

2005 Special Issue: A novel approach to extracting features from motif content and protein composition for protein sequence classification

Neural Networks - Special issue on neural networks and kernel methods for structured domains
Algorithm Note: Variable predictive model based classification algorithm for effective separation of protein structural classes

Computational Biology and Chemistry
Gauss-integral based representation of protein structure for predicting the fold class from the sequence

Mathematical and Computer Modelling: An International Journal

Quantified Score

Hi-index	0.98

Visualization

Abstract

Statistical methods of discrimination and classification are used for the prediction of protein structure from amino acid sequence data. This provides information for the establishment of new paradigms of carcinogenesis modeling on the basis of gene expression. Feed forward neural networks and standard statistical classification procedures are used to classify proteins into fold classes. Logistic regression, additive models, and projection pursuit regression from the family of methods based on a posterior probabilities; linear, quadratic, and a flexible discriminant analysis from the class of methods based on class conditional probabilities, and the nearest-neighbors classification rule are applied to a data set of 268 sequences. From analyzing the prediction error obtained with a test sample (n = 125) and with a cross validation procedure, we conclude that the standard linear discriminant analysis and nearest-neighbor methods are at the same time statistically feasible and potent competitors to the more flexible tools of feed forward neural networks. Further research is needed to explore the gain obtainable from statistical methods by the application to larger sets of protein sequence data and to compare the results with those from biophysical approaches.