Ensemble Approach for the Classification of Imbalanced Data

Authors:
Vladimir Nikulin;Geoffrey J. Mclachlan;Shu Kay Ng
Affiliations:
Department of Mathematics, University of Queensland,;Department of Mathematics, University of Queensland,;School of Medicine, Griffith University,
Venue:
AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Year:
2009

Citing 6
Cited 0

Bagging predictors

Machine Learning
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Random Forests

Machine Learning
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Consistency of Random Forests and Other Averaging Classifiers

The Journal of Machine Learning Research
Bagging support vector machine for classification of SELDI-ToF mass spectra of ovarian cancer serum samples

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ensembles are often capable of greater prediction accuracy than any of their individual members. As a consequence of the diversity between individual base-learners, an ensemble will not suffer from overfitting. On the other hand, in many cases we are dealing with imbalanced data and a classifier which was built using all data has tendency to ignore minority class. As a solution to the problem, we propose to consider a large number of relatively small and balanced subsets where representatives from the larger pattern are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent random subsets and columns represent features. Based on the above matrix we make an assessment of how stable the influence of the particular features is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which are not necessarily a linear regression. Test results against datasets of the PAKDD-2007 data-mining competition are presented.