Feature selection strategies for poorly correlated data: correlation coefficient considered harmful

  • Authors:
  • Silang Luo;David Corne

  • Affiliations:
  • School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, United Kingdom;School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, United Kingdom

  • Venue:
  • AIKED'08 Proceedings of the 7th WSEAS International Conference on Artificial intelligence, knowledge engineering and data bases
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature selection is often found to be an essential pre-processing step when data mining is applied to many-attribute datasets (e.g. several hundred or thousands of attributes). Feature selection aims to pre-select a relatively small number of attributes, thus speeding up further processing and (hopefully) eliminating data that have minimal or no discriminatory power. Often, feature selection is done on the basis of the straightforward statistical correlation, discarding features that have the lowest correlation with the target class(es). However, when these correlation values tend to be rather low for all features (common in many datasets of importance), the basis for pre-selection of any specific set of features is undermined, and straightforward feature selection may do more harm than good. We confirm this by investigating the performance of five feature selection strategies on several datasets with varying overall correlation values, finding that statistical correlation is never the best choice for poorly correlated data. The most reliable methods among those tested are either no feature selection, or Evolutionary Algorithm feature selection.