Feature selection for high-dimensional imbalanced data

Authors:
Liuzhi Yin;Yong Ge;Keli Xiao;Xuehua Wang;Xiaojun Quan
Affiliations:
School of Management, University of Science and Technology of China, Hefei, China;Rutgers Business School, Rutgers University, Newark, NJ, USA;Rutgers Business School, Rutgers University, Newark, NJ, USA;School of Management, Dalian University of Technology, Dalian, China;Department of Chinese, Translation and Linguistics, City University of Hong Kong, Hong Kong
Venue:
Neurocomputing
Year:
2013

Citing 24
Cited 1

A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Predicting rare classes: can boosting make any weak learner strong?

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An introduction to variable and feature selection

The Journal of Machine Learning Research
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Feature extraction by non parametric mutual information maximization

The Journal of Machine Learning Research
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Feature subset selection bias for classification learning

ICML '06 Proceedings of the 23rd international conference on Machine learning
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Minimum reference set based feature selection for small sample classifications

Proceedings of the 24th international conference on Machine learning
FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning Decision Trees for Unbalanced Data

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Non-monotonic feature selection

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Unified video annotation via multigraph learning

IEEE Transactions on Circuits and Systems for Video Technology
Beyond distance measurement: constructing neighborhood similarity for video annotation

IEEE Transactions on Multimedia - Special section on communities and media computing
SVMs modeling for highly imbalanced classification

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
On an Extended Fisher Criterion for Feature Selection

IEEE Transactions on Pattern Analysis and Machine Intelligence
A New Feature Selection Scheme Using a Data Distribution Factor for Unsupervised Nominal Data

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Effective Feature Extraction in High-Dimensional Space

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
l2,1-norm regularized discriminative feature selection for unsupervised learning

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Web and Personal Image Annotation by Mining Label Correlation With Relaxed Visual Graph Embedding

IEEE Transactions on Image Processing

Quality of information-based source assessment and selection

Neurocomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Given its importance, the problem of classification in imbalanced data has attracted great attention in recent years. However, few efforts have been made to develop feature selection techniques for the classification of imbalanced data. This paper thus fills this critical void by introducing two approaches for the feature selection of high-dimensional imbalanced data. To this end, after introducing three traditional methods, we study and illustrate the challenges of feature selection in imbalanced data with Bayesian learning. Indeed, we reveal that the samples in the larger classes have a dominant influence on these feature selection methods. However, the samples in rare classes are essential for the learning performances of rare classes. Based on these observations, we provide a new feature selection approach based on class decomposition. Specifically, we partition the large classes into relatively smaller pseudo-subclasses and generate the pseudo-class labels accordingly. Feature selection is then performed on the new decomposed data for computing the goodness measurement of features. In addition, we also introduce a Hellinger distance-based method for feature selection. Hellinger distance is a measure of distribution divergence, which is strongly skew insensitive as the class prior information is not involved for computing the distance. Finally, we theoretically show the effectiveness of these two approaches with Bayesian learning on synthetic data. We also test and compare the performances of the proposed feature-selection methods on some real-world data sets. The experimental results show that both decomposition-based and Hellinger distance-based methods can outperform existing feature-selection methods with a clear margin on imbalanced data.