Boosting with data generation: improving the classification of hard to learn examples

Authors:
Hongyu Guo;Herna L. Viktor
Affiliations:
School of Information Technology and Engineering, University of Ottawa, Ottawa, Ontario, Canada;School of Information Technology and Engineering, University of Ottawa, Ottawa, Ontario, Canada
Venue:
IEA/AIE'2004 Proceedings of the 17th international conference on Innovations in applied artificial intelligence
Year:
2004

Citing 0
Cited 1

Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

An ensemble of classifiers consists of a set of individually trained classifiers whose predictions are combined to classify new instances. In particular, boosting is an ensemble method where the performance of weak classifiers is improved by focusing on "hard examples" which are difficult to classify. Recent studies have indicated that boosting algorithm is applicable to a broad spectrum of problems with great success. However, boosting algorithms frequently suffer from overemphasizing the hard examples, leading to poor training and test set accuracies. Also, the knowledge acquired from such hard examples may be insufficient to improve the overall accuracy of the ensemble. This paper describes a new algorithm to solve the above-mentioned problems through data generation. In the DataBoost method, hard examples are identified during each of the iterations of the boosting algorithm. Subsequently, the hard examples are used to generate synthetic training data. These synthetic examples are added to the original training set and are used for further training. The paper shows the results of this approach against ten data sets, using both decision trees and neural networks as base classifiers. The experiments show promising results, in terms of the overall accuracy obtained.