Improving bagging performance through multi-algorithm ensembles

Authors:
Jaideep Srivastava;Kuo-Wei Hsu
Affiliations:
University of Minnesota;University of Minnesota
Venue:
Improving bagging performance through multi-algorithm ensembles
Year:
2011

Citing 0
Cited 1

Improving bagging performance through multi-algorithm ensembles

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bagging (bootstrap aggregating) is an ensemble method, establishing a committee of classifiers first and then aggregating their outcomes through majority voting. Bagging has attracted considerable research interest and has been applied in various application domains. Its advantages include an increased capability of handling small data sets, less sensitivity to noise or outliers, and a parallel structure for efficient implementations. However, in its present form, it has been found to be less accurate than some other ensemble methods, such as boosting. To unlock its power and expand its user base, we propose an approach that improves bagging through the use of multi-algorithm ensembles. In a multi-algorithm ensemble, multiple classification algorithms are employed. Our approach preserves the parallel structure, and hence the efficiency, of bagging, while simultaneously utilizing heterogeneous algorithms to improve the accuracy of bagging.Since diversity plays a critical role in ensemble methods, we first study the nature of diversity and derive two new diversity measures, namely T-Diversity and A-Diversity. The former considers different training data and the latter considers different classification algorithms. In addition, we formally define the twin notions of stability of an algorithm on data sets drawn from the same underlying distribution and heterogeneity of two algorithms on a data set. Most ensemble methods manipulate training data in the sample space and/or the feature space, ignoring the characteristics of classification algorithms. Some research utilizes heterogeneous algorithms, but lacks a full explanation and is hence ad hoc. We prove theoretically and empirically that compared to using different training data alone, using heterogeneous algorithms together with different training data increases diversity in ensembles, and hence we provide a fundamental explanation for the research utilizing heterogeneous algorithms. Furthermore, while diversity plays an important role in any ensemble method, its relationship to overall accuracy of the ensemble remains ambiguous in theory. We partially address this problem by providing a non-linear function that describes the relationship between diversity and correlation. It serves as a proxy between diversity and overall accuracy of an ensemble, since the relationship between correlation and overall accuracy has been studied by other researchers. While our result provides an indirect connection between diversity and accuracy, a direct connection remains an open problem.The bootstrap procedure is the exclusive source of diversity in bagging. Therefore, we use heterogeneity as another source of diversity and propose a framework utilizing heterogeneous algorithms in bagging. We discuss its design as well as plans used to select algorithms. For evaluation, we consider several benchmark data sets from various application domains. The results provide evidence that our approach outperforms classical bagging and is comparable to other state-of-the-art ensemble methods. Our approach demonstrates excellent potential for reconsidering bagging as an ensemble method of choice. Additionally, this approach remains the parallel structure of bagging, hence ensuring its efficiency while enhancing its accuracy by using multi-algorithm ensembles. In summary, we contribute to a better understanding of the role of diversity in ensemble learning, including a derivation of two types of diversity, a formalization of heterogeneity, and a proxy for the relationship between diversity and overall accuracy. Since the use of heterogeneous algorithms in ensembles has not been systematically studied before, we also provide theoretical and empirical support for multi-algorithm ensembles. These two ideas together provide an approach to a more advanced ensemble design.