Learning linear and kernel predictors with the 0-1 loss function

Authors:
Shai Shalev-Shwartz;Ohad Shamir;Karthik Sridharan
Affiliations:
The Hebrew University;Microsoft Research and The Hebrew University;Toyota Technological Institute
Venue:
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Year:
2011

Citing 8
Cited 0

The Strength of Weak Learnability

Machine Learning
Toward efficient agnostic learning

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Rademacher and gaussian complexities: risk bounds and structural results

The Journal of Machine Learning Research
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Agnostically Learning Halfspaces

FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Hardness of Learning Halfspaces with Noise

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Cryptographic Hardness for Learning Intersections of Halfspaces

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Some of the most successful machine learning algorithms, such as Support Vector Machines, are based on learning linear and kernel predictors with respect to a convex loss function, such as the hinge loss. For classification purposes, a more natural loss function is the 0-1 loss. However, using it leads to a non-convex problem for which there is no known efficient algorithm. In this paper, we describe and analyze a new algorithm for learning linear or kernel predictors with respect to the 0-1 loss function. The algorithm is parameterized by L, which quantifies the effective width around the decision boundary in which the predictor may be uncertain. We show that without any distributional assumptions, and for any fixed L, the algorithm runs in polynomial time, and learns a classifier which is worse than the optimal such classifier by at most ε. We also prove a hardness result, showing that under a certain cryptographic assumption, no algorithm can learn such classifiers in time polynomial in L.