Large scale instance selection by means of a parallel algorithm

Authors:
Aida de Haro-García;Juan Antonio Romero del Castillo;Nicolás García-Pedrajas
Affiliations:
Department of Computing and Numerical Analysis of the University of Córdoba, Córdoba, Spain;Department of Computing and Numerical Analysis of the University of Córdoba, Córdoba, Spain;Department of Computing and Numerical Analysis of the University of Córdoba, Córdoba, Spain
Venue:
IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Year:
2010

Citing 7
Cited 0

The grand tour: a tool for viewing multidimensional data

SIAM Journal on Scientific and Statistical Computing
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
Advances in Instance Selection for Instance-Based Learning Algorithms

Data Mining and Knowledge Discovery
A selective sampling approach to active feature selection

Artificial Intelligence
Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts

Artificial Intelligence
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study

IEEE Transactions on Evolutionary Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. However, although current algorithms are useful for fairly large datasets, many scaling problems are found when the number of instances is of hundred of thousands or millions. Most instance selection algorithms are of complexity at least O(n2), n being the number of instances. When we face huge problems, the scalability becomes an issue, and most of the algorithms are not applicable. This paper presents a way of removing this difficulty by means of a parallel algorithm that performs several rounds of instance selection on subsets of the original dataset. These rounds are combined using a voting scheme to allow a very good performance in terms of testing error and storage reduction, while the execution time of the process is decreased very significantly. The method is specially efficient when we use instance selection algorithms that are of a high computational cost. An extensive comparison in 35 datasets of medium and large sizes from the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 6 huge datasets (from three hundred thousands to more than four millions instances) with very good results and fast execution time.