Comments on supervised feature selection by clustering using conditional mutual information-based distances

Authors:
Nguyen X. Vinh;James Bailey
Affiliations:
The University of New South Wales, Kensington Sydney 2052, Australia;The University of Melbourne, Melbourne, Australia
Venue:
Pattern Recognition
Year:
2013

Citing 7
Cited 0

Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Gait feature subset selection by mutual information

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans - Special section: Best papers from the 2007 biometrics: Theory, applications, and systems (BTAS 07) conference
Supervised feature selection by clustering using conditional mutual information-based distances

Pattern Recognition
Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
Feature selection in regression tasks using conditional mutual information

IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

Supervised feature selection is an important problem in pattern recognition. Of the many methods introduced, those based on the mutual information and conditional mutual information measures are among the most widely adopted approaches. In this paper, we re-analyze an interesting paper on this topic recently published by Sotoca and Pla (Pattern Recognition, Vol. 43 Issue 6, June, 2010, pp. 2068-2081). In that work, a method for supervised feature selection based on clustering the features into groups is proposed, using a conditional mutual information based distance measure. The clustering procedure minimizes the objective function named the minimal relevant redundancy-mRR criterion. It is proposed that this objective function is the upper bound of the information loss when the full set of features is replaced by a smaller subset. We have found that their proof for this proposition is based on certain erroneous assumptions, and that the proposition itself is not true in general. In order to remedy the reported work, we characterize the specific conditions under which the assumptions used in the proof, and hence the proposition, hold true. It is our finding that there is a reasonable condition, namely when all features are independent given the class variable (as assumed by the popular naive Bayes classifier), under which the assumptions as required by Sotoca and Pla's framework hold true.