Making class bias useful: a strategy of learning from imbalanced data

Authors:
Jie Gu;Yuanbing Zhou;Xianqiang Zuo
Affiliations:
Software School of Tsinghua University;State Power Economic Research Institute, China;State Power Economic Research Institute, China
Venue:
IDEAL'07 Proceedings of the 8th international conference on Intelligent data engineering and automated learning
Year:
2007

Citing 7
Cited 2

Random Forests

Machine Learning
A Comparative Study of Cost-Sensitive Boosting Algorithms

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Cost-Sensitive Learning by Cost-Proportionate Example Weighting

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Towards systematic design of distance functions for data mining applications

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition

Ecological inference in empirical software engineering

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Prediction of liquefaction potential based on CPT up-sampling

Computers & Geosciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of many learning methods are usually influenced by the class imbalance problem, where the training data is dominated by the instances belonging to one class. In this paper, we propose a novel method which combines random forest based techniques and sampling methods for effectively learning from imbalanced data. Our method is mainly composed of two phases: data cleaning and classification based on random forest. Firstly, the training data is cleaned through the elimination of dangerous negative instances. The data cleaning process is supervised by a negative biased random forest, where the negative instances have a major proportion of the training data in each of the tree in the forest. Secondly, we develop a variant of random forest in which each tree is biased towards the positive class to classify the data set, where a major vote is provided for prediction. In the experimental test, we compared our method with other existing methods on the real data sets, and the results demonstrate the significative performance improvement of our method in terms of the area under the ROC curve(AUC).