Racing Committees for Large Datasets

Authors:
Eibe Frank;Geoffrey Holmes;Richard Kirkby;Mark Hall
Affiliations:
-;-;-;-
Venue:
DS '02 Proceedings of the 5th International Conference on Discovery Science
Year:
2002

Citing 7
Cited 1

The application of AdaBoost for distributed, scalable and on-line learning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Experimental comparisons of online and batch versions of bagging and boosting

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A streaming ensemble algorithm (SEA) for large-scale classification

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Cost complexity-based pruning of ensemble classifiers

Knowledge and Information Systems
A Study of Two Sampling Methods for Analyzing Large Datasets with ILP

Data Mining and Knowledge Discovery
Pasting Small Votes for Classification in Large Databases and On-Line

Machine Learning
Pruning Adaptive Boosting

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning

Apriori algorithm and game-of-life for predictive analysis in materials science

International Journal of Knowledge-based and Intelligent Engineering Systems - Soft Computing and its Applications to E-Business

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers built from these individual chunks. Our method extends earlier work by introducing a method for adaptively pruning the committee. This is essential when applying the algorithm in practice because it dramatically reduces the algorithm's running time and memory consumption. It also makes it possible to efficiently "race" committees corresponding to different chunk sizes. This is important because our empirical results show that the accuracy of the resulting committee can vary significantly with the chunk size. They also show that pruning is indeed crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy.