A comparison of two strategies for scaling up instance selection in huge datasets

Authors:
Aida De Haro-García;Javier Pérez-Rodríguez;Nicolás García-Pedrajas
Affiliations:
Department of Computing and Numerical Analysis, University of Córdoba, Spain;Department of Computing and Numerical Analysis, University of Córdoba, Spain;Department of Computing and Numerical Analysis, University of Córdoba, Spain
Venue:
CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Year:
2011

Citing 7
Cited 0

Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
On Issues of Instance Selection

Data Mining and Knowledge Discovery
Advances in Instance Selection for Instance-Based Learning Algorithms

Data Mining and Knowledge Discovery
Inference for the Generalization Error

Machine Learning
Stratification for scaling up evolutionary prototype selection

Pattern Recognition Letters
Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts

Artificial Intelligence
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study

IEEE Transactions on Evolutionary Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. However, although current algorithms are useful for fairly large datasets, many scaling problems are found when the number of instances is of hundred of thousands or millions. Most instance selection algorithms are of complexity at least O(n2), n being the number of instances. When we face huge problems, the scalability becomes an issue, and most of the algorithms are not applicable. Recently, two general methods for scaling up instance selection algorithms have been published in the literature: stratification and democratization. Both methods are able to successfully deal with large datasets. In this paper we show a comparison of these two methods when applied to very large and huge datasets up to 50,000,000 instances. Additionally, we also test their performance in huge datasets that are also classimbalanced. The comparison is made using a parallel implementation of both methods to fully exploit their possibilities. Although both methods show very good behavior in terms of testing error, storage reduction and execution time, democratization proves an overall better performance.