Simple sequence-based kernels do not predict protein–protein interactions

Authors:
Jiantao Yu;Maozu Guo;Chris J. Needham;Yangchao Huang;Lu Cai;David R. Westhead
Affiliations:
-;-;-;-;-;-
Venue:
Bioinformatics
Year:
2010

Citing 0
Cited 2

Using machine learning techniques and genomic/proteomic information from known databases for defining relevant features for PPI classification

Computers in Biology and Medicine
Mining Minimal Motif Pair Sets Maximally Covering Interactions in a Protein-Protein Interaction Network

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: A number of methods have been reported that predict protein–protein interactions (PPIs) with high accuracy using only simple sequence-based features such as amino acid 3mer content. This is surprising, given that many protein interactions have high specificity that depends on detailed atomic recognition between physiochemically complementary surfaces. Are the reported high accuracies realistic? Results: We find that the reported accuracies of the predictions are significantly over-estimated, and strongly dependent on the structure of the training and testing datasets used. The choice of which protein pairs are deemed as non-interactions in the training data has a variable impact on the accuracy estimates, and the accuracies can be artificially inflated by a bias towards dominant samples in the positive data which result from the presence of hub proteins in the protein interaction network. To address this bias, we propose a positive set-specific method to create a ‘balanced’ negative set maintaining the degree distribution for each protein, leading to the conclusion that simple sequence-based features contain insufficient information to be useful for predicting PPIs, but that protein domain-based features have some predictive value. Availability: Our method, named ‘BRS-nonint’, is available at http://www.bioinformatics.leeds.ac.uk/BRS-nonint/. All the datasets used in this study are derived from publicly available data, and are available at http://www.bioinformatics.leeds.ac.uk/BRS-nonint/PPI_RandomBalance.html Contact:maozuguo@hit.edu.cn; d.r.westhead@leeds.ac.uk