FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems

Authors:
Xue-wen Chen;Michael Wasikowski
Affiliations:
The University of Kansas, Lawrence, KS, USA;The University of Kansas, Lawrence, KS, USA
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 25
Cited 9

The Use of Background Knowledge in Decision Tree Induction

Machine Learning
A method for inductive cost optimization

EWSL-91 Proceedings of the European working session on learning on Machine learning
A practical approach to feature selection

ML92 Proceedings of the ninth international workshop on Machine learning
Estimating attributes: analysis and extensions of RELIEF

ECML-94 Proceedings of the European conference on machine learning on Machine Learning
Floating search methods in feature selection

Pattern Recognition Letters
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Learning When Negative Examples Abound

ECML '97 Proceedings of the 9th European Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An improved branch and bound algorithm for feature selection

Pattern Recognition Letters
An introduction to variable and feature selection

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Extreme re-balancing for SVMs: a case study

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000

Machine Learning
Efficient Feature Selection via Analysis of Relevance and Redundancy

The Journal of Machine Learning Research
Immunological Bioinformatics (Computational Molecular Biology)

Immunological Bioinformatics (Computational Molecular Biology)
The relationship between Precision-Recall and ROC curves

ICML '06 Proceedings of the 23rd international conference on Machine learning
Boosting for Learning Multiple Classes with Imbalanced Class Distribution

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Minimum reference set based feature selection for small sample classifications

Proceedings of the 24th international conference on Machine learning
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Learning classifiers from imbalanced data based on biased minimax probability machine

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition

Feature selection with biased sample distributions

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
An effective feature selection method for text categorization

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
A minority class feature selection method

CIARP'11 Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Feature selection for MAUC-oriented classification systems

Neurocomputing
Feature selection for optimizing traffic classification

Computer Communications
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets

Data & Knowledge Engineering
Feature selection for high-dimensional imbalanced data

Neurocomputing
Comparison of text feature selection policies and using an adaptive framework

Expert Systems with Applications: An International Journal
Cost-sensitive decision tree ensembles for effective imbalanced classification

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The class imbalance problem is encountered in a large number of practical applications of machine learning and data mining, for example, information retrieval and filtering, and the detection of credit card fraud. It has been widely realized that this imbalance raises issues that are either nonexistent or less severe compared to balanced class cases and often results in a classifier's suboptimal performance. This is even more true when the imbalanced data are also high dimensional. In such cases, feature selection methods are critical to achieve optimal performance. In this paper, we propose a new feature selection method, Feature Assessment by Sliding Thresholds (FAST), which is based on the area under a ROC curve generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. FAST is compared to two commonly-used feature selection methods, correlation coefficient and RELevance In Estimating Features (RELIEF), for imbalanced data classification. The experimental results obtained on text mining, mass spectrometry, and microarray data sets showed that the proposed method outperformed both RELIEF and correlation methods on skewed data sets and was comparable on balanced data sets; when small number of features is preferred, the classification performance of the proposed method was significantly improved compared to correlation and RELIEF-based methods.