Core-generating approximate minimum entropy discretization for rough set feature selection in pattern classification

  • Authors:
  • David Tian;Xiao-jun Zeng;John Keane

  • Affiliations:
  • Department of Computing, Faculty of ACES, Sheffield Hallam University, Howard Street, Sheffield S1 1WB, UK and School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL ...;School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK;School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK

  • Venue:
  • International Journal of Approximate Reasoning
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Rough set feature selection (RSFS) can be used to improve classifier performance. RSFS removes redundant attributes whilst retaining important ones that preserve the classification power of the original dataset. Reducts are feature subsets selected by RSFS. Core is the intersection of all the reducts of a dataset. RSFS can only handle discrete attributes, hence, continuous attributes need to be discretized before being input to RSFS. Discretization determines the core size of a discrete dataset. However, current discretization methods do not consider the core size during discretization. Earlier work has proposed core-generating approximate minimum entropy discretization (C-GAME) algorithm which selects the maximum number of minimum entropy cuts capable of generating a non-empty core within a discrete dataset. The contributions of this paper are as follows: (1) the C-GAME algorithm is improved by adding a new type of constraint to eliminate the possibility that only a single reduct is present in a C-GAME-discrete dataset; (2) performance evaluation of C-GAME in comparison to C4.5, multi-layer perceptrons, RBF networks and k-nearest neighbours classifiers on ten datasets chosen from the UCI Machine Learning Repository; (3) performance evaluation of C-GAME in comparison to Recursive Minimum Entropy Partition (RMEP), Chimerge, Boolean Reasoning and Equal Frequency discretization algorithms on the ten datasets; (4) evaluation of the effects of C-GAME and the other four discretization methods on the sizes of reducts; (5) an upper bound is defined on the total number of reducts within a dataset; (6) the effects of different discretization algorithms on the total number of reducts are analysed; (7) performance analysis of two RSFS algorithms (a genetic algorithm and Johnson's algorithm).