Beyond Fano's inequality: bounds on the optimal F-score, BER, and cost-sensitive risk and their implications

Authors:
Ming-Jie Zhao;Narayanan Edakunni;Adam Pocock;Gavin Brown
Affiliations:
School of Computer Science, University of Manchester, Manchester, UK;School of Computer Science, University of Manchester, Manchester, UK;School of Computer Science, University of Manchester, Manchester, UK;School of Computer Science, University of Manchester, Manchester, UK
Venue:
The Journal of Machine Learning Research
Year:
2013

Citing 21
Cited 0

On the relationship between the information measures and the bayes probability of error

IEEE Transactions on Information Theory
An application of the principle of maximum information preservation to linear systems

Advances in neural information processing systems 1
An information-maximization approach to blind separation and blind deconvolution

Neural Computation
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Understanding Probabilistic Classifiers

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
Feature extraction by non parametric mutual information maximization

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Lower and Upper Bounds for Misclassification Probability Based on Renyi's Information

Journal of VLSI Signal Processing Systems
Convex Optimization

Convex Optimization
Data mining in metric space: an empirical analysis of supervised learning performance criteria

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Feature Extraction Using Information-Theoretic Learning

IEEE Transactions on Pattern Analysis and Machine Intelligence
On the Consistency of Multiclass Classification Methods

The Journal of Machine Learning Research
Introduction to Information Retrieval

Introduction to Information Retrieval
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

The Journal of Machine Learning Research
Uncertainty and the probability of error (Corresp.)

IEEE Transactions on Information Theory
Probability of error, equivocation, and the Chernoff bound

IEEE Transactions on Information Theory
Relations between entropy and error probability

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fano's inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon's as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano's result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate--we derive similar optimal thresholds for F-score and BER.