Detection of Outlier Residues for Improving Interface Prediction in Protein Heterocomplexes

Authors:
Peng Chen;Limsoon Wong;Jinyan Li
Affiliations:
Chinese Academy of Sciences, Hefei;National University of Singapore, Singapore;National University of Singapore, Singapore and the University of Technology Sydney
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2012

Citing 13
Cited 0

Support-Vector Networks

Machine Learning
Activity monitoring: noticing interesting changes in behavior

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Bioinformatics: the machine learning approach

Bioinformatics: the machine learning approach
Outlier Detection Integrating Semantic Knowledge

WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Improved prediction of protein--protein binding sites using a support vector machines approach

Bioinformatics
ISIS: interaction sites identified from sequence

Bioinformatics
Interaction-site prediction for protein complexes

Bioinformatics
A sequential dual method for large scale multi-class linear svms

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Sequence-based prediction of protein interaction sites with an integrative method

Bioinformatics
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
A novelty detection approach to classification

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned training data are then used for improving the prediction performance. We use three novel measures to describe the extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed. The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM ensemble trained on input data without outliers performs better than that with outliers. Our method is also more accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface regions.