Evaluation of the importance of data pre-processing order when combining feature selection and data sampling

  • Authors:
  • Ahmad Abu Shanab;Taghi M. Khoshgoftaar;Randall Wald;Jason Van Hulse

  • Affiliations:
  • Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.;Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.;Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.;Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA

  • Venue:
  • International Journal of Business Intelligence and Data Mining
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Two problems often encountered in machine learning are class imbalance and high dimensionality. In this paper we compare three different approaches for addressing both problems simultaneously, by applying both data sampling and feature selection. With the first two approaches, sampling is followed by feature selection. In the first approach, the features are selected based on the sampled data, and then the unsampled data is used with just the selected features. The second approach is similar, but the sampled data is used. Finally, with the third approach, feature selection is performed prior to sampling. To compare the approaches, we use seven datasets from different domains, employ nine feature rankers from three different families, apply three sampling techniques, and inject class noise to better simulate real-world datasets. The results show that the second and third approaches are both very good, with the third approach showing a slight (but not statistically significant) lead.