A Kolmogorov-Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction

Authors:
Rongsheng Gong;Samuel H. Huang
Affiliations:
Intelligent Systems Laboratory, School of Dynamic Systems, University of Cincinnati, Cincinnati, OH 45221, United States;Intelligent Systems Laboratory, School of Dynamic Systems, University of Cincinnati, Cincinnati, OH 45221, United States
Venue:
Expert Systems with Applications: An International Journal
Year:
2012

Citing 16
Cited 1

Improved Estimates for the Accuracy of Small Disjuncts

Machine Learning
Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Feature Selection via Discretization

IEEE Transactions on Knowledge and Data Engineering
A Genetic Algorithm With Sequential Niching For Discovering Small-disjunct Rules

GECCO '02 Proceedings of the Genetic and Evolutionary Computation Conference
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Dimensionality Reduction in Automatic Knowledge Acquisition: A Simple Greedy Search Approach

IEEE Transactions on Knowledge and Data Engineering
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Extreme re-balancing for SVMs: a case study

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Test Strategies for Cost-Sensitive Decision Trees

IEEE Transactions on Knowledge and Data Engineering
The unbalanced classification problem: detecting breaches in security

The unbalanced classification problem: detecting breaches in security
The class imbalance problem: A systematic study

Intelligent Data Analysis
2008 Special Issue: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance

Neural Networks
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Concept learning and the problem of small disjuncts

IJCAI'89 Proceedings of the 11th international joint conference on Artificial intelligence - Volume 1
Learning from imbalanced data in surveillance of nosocomial infection

Artificial Intelligence in Medicine

A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.06

Visualization

Abstract

Classification is an important task in data mining. Class imbalance has been reported to hinder the performance of standard classification models. However, our study shows that class imbalance may not be the only cause to blame for poor performance. Rather, the underlying complexity of the problem may play a more fundamental role. In this paper, a decision tree method based on Kolmogorov-Smirnov statistic (K-S tree), is proposed to segment the training data so that a complex problem can be divided into several easier sub-problems where class imbalance becomes less challenging. K-S tree is also used to perform feature selection, which not only selects relevant variables but also removes redundant ones. After segmentation, a two-way re-sampling method is used at the segment level to empirically determine the optimal sampling percentage and the rebalanced data is used to fit logistic regression models, also at the segment level. The effectiveness of the proposed method is demonstrated through its application on property refinance prediction.