Understanding Probabilistic Classifiers

Authors:
Ashutosh Garg;Dan Roth
Affiliations:
-;-
Venue:
EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Year:
2001

Citing 8
Cited 5

A theory of the learnable

Communications of the ACM
Elements of information theory

Elements of information theory
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality

Data Mining and Knowledge Discovery
Understanding Probabilistic Classifiers

Understanding Probabilistic Classifiers
A statistical approach to 3d object detection applied to faces and cars

A statistical approach to 3d object detection applied to faces and cars
Learning in natural language

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2

Local and Global Methods in Data Mining: Basic Techniques and Open Problems

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Reasoning about bayesian network classifiers

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Text mining using markov chains of variable length

Proceedings of the 2005 international conference on Federation over the Web
Towards a definition of evaluation criteria for probabilistic classifiers

ECSQARU'05 Proceedings of the 8th European conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty
Beyond Fano's inequality: bounds on the optimal F-score, BER, and cost-sensitive risk and their implications

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Probabilistic classifiers are developed by assuming generative models which are product distributions over the original attribute space (as in naive Bayes) or more involved spaces (as in general Bayesian networks). While this paradigm has been shown experimentally successful on real world applications, despite vastly simplified probabilistic assumptions, the question of why these approaches work is still open. This paper resolves this question.We show that almost all joint distributions with a given set of marginals (i.e., all distributions that could have given rise to the classifier learned) or, equivalently, almost all data sets that yield this set of marginals, are very close (in terms of distributional distance) to the product distribution on the marginals; the number of these distributions goes down exponentially with their distance from the product distribution. Consequently, as we show, for almost all joint distributions with this set of marginals, the penalty incurred in using the marginal distribution rather than the true one is small. In addition to resolving the puzzle surrounding the success of probabilistic classifiers our results contribute to understanding the tradeoffs in developing probabilistic classifiers and will help in developing better classifiers.