Research article: Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions

Authors:
Yongqing Zhang;Danling Zhang;Gang Mi;Daichuan Ma;Gongbing Li;Yanzhi Guo;Menglong Li;Min Zhu
Affiliations:
College of Computer Science, Sichuan University, Chengdu 610065, PR China;College of Computer Science, Sichuan University, Chengdu 610065, PR China;School of Life Science, Sichuan University, Chengdu 610064, PR China;College of Chemistry, Sichuan University, Chengdu 610064, PR China;College of Computer Science, Sichuan University, Chengdu 610065, PR China;College of Chemistry, Sichuan University, Chengdu 610064, PR China;College of Chemistry, Sichuan University, Chengdu 610064, PR China;College of Computer Science, Sichuan University, Chengdu 610065, PR China
Venue:
Computational Biology and Chemistry
Year:
2012

Citing 15
Cited 1

An introduction to computing with neural nets

Artificial neural networks: theoretical concepts
The nature of statistical learning theory

The nature of statistical learning theory
Bagging predictors

Machine Learning
Neural Network Ensembles

IEEE Transactions on Pattern Analysis and Machine Intelligence
Class imbalances versus small disjuncts

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Extreme re-balancing for SVMs: a case study

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Bioinformatics
Cluster-based under-sampling approaches for imbalanced data distributions

Expert Systems with Applications: An International Journal
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Learning from imbalanced data in surveillance of nosocomial infection

Artificial Intelligence in Medicine
Evaluation of ensemble methods for diagnosing of valvular heart disease

Expert Systems with Applications: An International Journal
Two-level hierarchical combination method for text classification

Expert Systems with Applications: An International Journal
Learning classifiers from imbalanced data based on biased minimax probability machine

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
The condensed nearest neighbor rule (Corresp.)

IEEE Transactions on Information Theory

Ensemble learning for generalised eigenvalues proximal support vector machines

International Journal of Computer Applications in Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In proteins, the number of interacting pairs is usually much smaller than the number of non-interacting ones. So the imbalanced data problem will arise in the field of protein-protein interactions (PPIs) prediction. In this article, we introduce two ensemble methods to solve the imbalanced data problem. These ensemble methods combine the based-cluster under-sampling technique and the fusion classifiers. And then we evaluate the ensemble methods using a dataset from Database of Interacting Proteins (DIP) with 10-fold cross validation. All the prediction models achieve area under the receiver operating characteristic curve (AUC) value about 95%. Our results show that the ensemble classifiers are quite effective in predicting PPIs; we also gain some valuable conclusions on the performance of ensemble methods for PPIs in imbalanced data. The prediction software and all dataset employed in the work can be obtained for free at http://cic.scu.edu.cn/bioinformatics/Ensemble_PPIs/index.html.