Evaluation of the importance of data pre-processing order when combining feature selection and data sampling

Authors:
Ahmad Abu Shanab;Taghi M. Khoshgoftaar;Randall Wald;Jason Van Hulse
Affiliations:
Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.;Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.;Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.;Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA
Venue:
International Journal of Business Intelligence and Data Mining
Year:
2012

Citing 19
Cited 0

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Chi2: Feature Selection and Discretization of Numeric Attributes

TAI '95 Proceedings of the Seventh International Conference on Tools with Artificial Intelligence
Benchmarking Attribute Selection Techniques for Discrete Class Data Mining

IEEE Transactions on Knowledge and Data Engineering
Microarray data mining: facing the challenges

ACM SIGKDD Explorations Newsletter
Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
Toward Integrating Feature Selection Algorithms for Classification and Clustering

IEEE Transactions on Knowledge and Data Engineering
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression

Bioinformatics
Bias Analysis in Text Classification for Highly Skewed Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
On the Class Imbalance Problem

ICNC '08 Proceedings of the 2008 Fourth International Conference on Natural Computation - Volume 04
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Knowledge discovery from imbalanced and noisy data

Data & Knowledge Engineering
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
A Comparative Study of Threshold-Based Feature Selection Techniques

GRC '10 Proceedings of the 2010 IEEE International Conference on Granular Computing
Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction

ICTAI '10 Proceedings of the 2010 22nd IEEE International Conference on Tools with Artificial Intelligence - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

Two problems often encountered in machine learning are class imbalance and high dimensionality. In this paper we compare three different approaches for addressing both problems simultaneously, by applying both data sampling and feature selection. With the first two approaches, sampling is followed by feature selection. In the first approach, the features are selected based on the sampled data, and then the unsampled data is used with just the selected features. The second approach is similar, but the sampled data is used. Finally, with the third approach, feature selection is performed prior to sampling. To compare the approaches, we use seven datasets from different domains, employ nine feature rankers from three different families, apply three sampling techniques, and inject class noise to better simulate real-world datasets. The results show that the second and third approaches are both very good, with the third approach showing a slight (but not statistically significant) lead.