Combating the Small Sample Class Imbalance Problem Using Feature Selection

Authors:
Mike Wasikowski;Xue-wen Chen
Affiliations:
United States Army Traning and Doctrine Command Analysis Center, Fort Leavenworth;The University of Kansas, Lawrence
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2010

Citing 0
Cited 10

Classification of high dimensional and imbalanced hyperspectral imagery data

IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
Addressing the classification with imbalanced data: open problems and new challenges on class distribution

HAIS'11 Proceedings of the 6th international conference on Hybrid artificial intelligent systems - Volume Part I
Exploring synergetic effects of dimensionality reduction and resampling tools on hyperspectral imagery data classification

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Feature selection for optimizing traffic classification

Computer Communications
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets

Data & Knowledge Engineering
ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data

Neurocomputing
Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches

Knowledge-Based Systems
Rule-Based Semantic Concept Classification from Large-Scale Video Collections

International Journal of Multimedia Data Engineering & Management
Training and assessing classification rules with imbalanced data

Data Mining and Knowledge Discovery
On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The class imbalance problem is encountered in real-world applications of machine learning and results in a classifier's suboptimal performance. Researchers have rigorously studied the resampling, algorithms, and feature selection approaches to this problem. No systematic studies have been conducted to understand how well these methods combat the class imbalance problem and which of these methods best manage the different challenges posed by imbalanced data sets. In particular, feature selection has rarely been studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using area under the receiver operating characteristic (AUC) and area under the precision-recall curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and Feature Assessment by Sliding Thresholds (FAST) are great candidates for feature selection in most applications, especially when selecting very small numbers of features.