Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

Authors:
Gavin Brown;Adam Pocock;Ming-Jie Zhao;Mikel Luján
Affiliations:
School of Computer Science, University of Manchester, Manchester, UK;School of Computer Science, University of Manchester, Manchester, UK;School of Computer Science, University of Manchester, Manchester, UK;School of Computer Science, University of Manchester, Manchester, UK
Venue:
The Journal of Machine Learning Research
Year:
2012

Citing 23
Cited 8

Elements of information theory

Elements of information theory
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
On the Performance Assessment and Comparison of Stochastic Multiobjective Optimizers

PPSN IV Proceedings of the 4th International Conference on Parallel Problem Solving from Nature
Estimation of entropy and mutual information

Neural Computation
Object Recognition with Informative Features and Linear Classification

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Learning Bayesian network classifiers by maximizing conditional likelihood

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Efficient Feature Selection via Analysis of Relevance and Redundancy

The Journal of Machine Learning Research
Large-Sample Learning of Bayesian Networks is NP-Hard

The Journal of Machine Learning Research
Fast Binary Feature Selection with Conditional Mutual Information

The Journal of Machine Learning Research
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing)

Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing)
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Stability of feature selection algorithms: a study on high-dimensional spaces

Knowledge and Information Systems
A stability index for feature selection

AIAP'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications
Stable feature selection via dense feature groups

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Stable and Accurate Feature Selection

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Gait feature subset selection by mutual information

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans - Special section: Best papers from the 2007 biometrics: Theory, applications, and systems (BTAS 07) conference
On the Feature Selection Criterion Based on an Approximation of Multidimensional Mutual Information

IEEE Transactions on Pattern Analysis and Machine Intelligence
Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

The Journal of Machine Learning Research
Conditional infomax learning: an integrated framework for feature extraction and fusion

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part I
On the use of variable complementarity for feature selection in cancer classification

EuroGP'06 Proceedings of the 2006 international conference on Applications of Evolutionary Computing
Input feature selection for classification problems

IEEE Transactions on Neural Networks

Efficient feature selection filters for high-dimensional data

Pattern Recognition Letters
Information-theoretic selection of high-dimensional spectral features for structural recognition

Computer Vision and Image Understanding
Feature selection techniques with class separability for multivariate time series

Neurocomputing
Beyond Fano's inequality: bounds on the optimal F-score, BER, and cost-sensitive risk and their implications

The Journal of Machine Learning Research
Feature Interaction Maximisation

Pattern Recognition Letters
Control-flow integrity principles, implementations, and applications

ACM Transactions on Information and System Security (TISSEC)
A real-time transportation prediction system

Applied Intelligence
Towards excluding redundancy in electrode grid for automatic speech recognition based on surface EMG

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?". To answer this, we adopt a different strategy than is usual in the feature selection literature--instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy', our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples.