A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Predicting rare classes: can boosting make any weak learner strong?
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An introduction to variable and feature selection
The Journal of Machine Learning Research
A divisive information theoretic feature clustering algorithm for text classification
The Journal of Machine Learning Research
Feature extraction by non parametric mutual information maximization
The Journal of Machine Learning Research
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Introduction to Data Mining, (First Edition)
Introduction to Data Mining, (First Edition)
Feature subset selection bias for classification learning
ICML '06 Proceedings of the 23rd international conference on Machine learning
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Minimum reference set based feature selection for small sample classifications
Proceedings of the 24th international conference on Machine learning
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning Decision Trees for Unbalanced Data
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Non-monotonic feature selection
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Unified video annotation via multigraph learning
IEEE Transactions on Circuits and Systems for Video Technology
Beyond distance measurement: constructing neighborhood similarity for video annotation
IEEE Transactions on Multimedia - Special section on communities and media computing
SVMs modeling for highly imbalanced classification
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
Exploratory undersampling for class-imbalance learning
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
On an Extended Fisher Criterion for Feature Selection
IEEE Transactions on Pattern Analysis and Machine Intelligence
A New Feature Selection Scheme Using a Data Distribution Factor for Unsupervised Nominal Data
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Effective Feature Extraction in High-Dimensional Space
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
l2,1-norm regularized discriminative feature selection for unsupervised learning
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Web and Personal Image Annotation by Mining Label Correlation With Relaxed Visual Graph Embedding
IEEE Transactions on Image Processing
Quality of information-based source assessment and selection
Neurocomputing
Hi-index | 0.01 |
Given its importance, the problem of classification in imbalanced data has attracted great attention in recent years. However, few efforts have been made to develop feature selection techniques for the classification of imbalanced data. This paper thus fills this critical void by introducing two approaches for the feature selection of high-dimensional imbalanced data. To this end, after introducing three traditional methods, we study and illustrate the challenges of feature selection in imbalanced data with Bayesian learning. Indeed, we reveal that the samples in the larger classes have a dominant influence on these feature selection methods. However, the samples in rare classes are essential for the learning performances of rare classes. Based on these observations, we provide a new feature selection approach based on class decomposition. Specifically, we partition the large classes into relatively smaller pseudo-subclasses and generate the pseudo-class labels accordingly. Feature selection is then performed on the new decomposed data for computing the goodness measurement of features. In addition, we also introduce a Hellinger distance-based method for feature selection. Hellinger distance is a measure of distribution divergence, which is strongly skew insensitive as the class prior information is not involved for computing the distance. Finally, we theoretically show the effectiveness of these two approaches with Bayesian learning on synthetic data. We also test and compare the performances of the proposed feature-selection methods on some real-world data sets. The experimental results show that both decomposition-based and Hellinger distance-based methods can outperform existing feature-selection methods with a clear margin on imbalanced data.