Avoiding Boosting Overfitting by Removing Confusing Samples

  • Authors:
  • Alexander Vezhnevets;Olga Barinova

  • Affiliations:
  • Moscow State University, dept. of Computational Mathematics and Cybernetics, Graphics and Media Lab, 119992 Moscow, Russia;Moscow State University, dept. of Computational Mathematics and Cybernetics, Graphics and Media Lab, 119992 Moscow, Russia

  • Venue:
  • ECML '07 Proceedings of the 18th European conference on Machine Learning
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Boosting methods are known to exhibit noticeable overfitting on some datasets, while being immune to overfitting on other ones. In this paper we show that standard boosting algorithms are not appropriate in case of overlapping classes. This inadequateness is likely to be the major source of boosting overfitting while working with real world data. To verify our conclusion we use the fact that any overlapping classes' task can be reduced to a deterministic task with the same Bayesian separating surface. This can be done by removing "confusing samples" --- samples that are misclassified by a "perfect" Bayesian classifier. We propose an algorithm for removing confusing samples and experimentally study behavior of AdaBoost trained on the resulting data sets. Experiments confirm that removing confusing samples helps boosting to reduce the generalization error and to avoid overfitting on both synthetic and real world. Process of removing confusing samples also provides an accurate error prediction based on the work with the training sets.