Reducing overfitting of AdaBoost by clustering-based pruning of hard examples

Authors:
Dae-Sun Kim;Yeul-Min Baek;Whoi-Yul Kim
Affiliations:
Hanyang University, Seoul, Korea;Hanyang University, Seoul, Korea;Hanyang University, Seoul, Korea
Venue:
Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Year:
2013

Citing 8
Cited 1

A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
Soft Margins for AdaBoost

Machine Learning
Class-switching neural network ensembles

Neurocomputing
Avoiding Boosting Overfitting by Removing Confusing Samples

ECML '07 Proceedings of the 18th European conference on Machine Learning
Edited AdaBoost by weighted kNN

Neurocomputing
A noise-detection based AdaBoost algorithm for mislabeled data

Pattern Recognition

A framework for selection and fusion of pattern classifiers in multimedia recognition

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to solve the problem of overfitting in AdaBoost, we propose a novel AdaBoost algorithm using K-means clustering. AdaBoost is known as an effective method for improving the performance of base classifiers both theoretically and empirically. However, previous studies have shown that AdaBoost is prone to overfitting in overlapped classes. In order to overcome the overfitting problem of AdaBoost, the proposed method uses K-means clustering to remove hard-to-learn samples that exist on overlapped region. Since the proposed method does not consider hard-to-learn samples, it suffers less from the overfitting problem compared to conventional AdaBoost. Both synthetic and real world data were tested to confirm the validity of the proposed method.