A divide-and-conquer recursive approach for scaling up instance selection algorithms

Authors:
Aida Haro-García;Nicolás García-Pedrajas
Affiliations:
Department of Computing and Numerical Analysis, University of Córdoba, Córdoba, Spain 14071;Department of Computing and Numerical Analysis, University of Córdoba, Córdoba, Spain 14071
Venue:
Data Mining and Knowledge Discovery
Year:
2009

Citing 28
Cited 7

The GENITOR algorithm and selection pressure: why rank-based allocation of reproductive trials is best

Proceedings of the third international conference on Genetic algorithms
Adaptation in natural and artificial systems

Adaptation in natural and artificial systems
Elements of information theory

Elements of information theory
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Genetic algorithms + data structures = evolution programs (2nd, extended ed.)

Genetic algorithms + data structures = evolution programs (2nd, extended ed.)
Editing for the k-nearest neighbors rule by a genetic algorithm

Pattern Recognition Letters - Special issue on genetic algorithms
Recursive Automatic Bias Selection for Classifier Construction

Machine Learning - Special issue on bias evaluation and selection
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Genetic Algorithms in Search, Optimization and Machine Learning

Genetic Algorithms in Search, Optimization and Machine Learning
Feature Selection for Knowledge Discovery and Data Mining

Feature Selection for Knowledge Discovery and Data Mining
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
On Issues of Instance Selection

Data Mining and Knowledge Discovery
Advances in Instance Selection for Instance-Based Learning Algorithms

Data Mining and Knowledge Discovery
Instance Pruning Techniques

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Population-Based Incremental Learning: A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning

Population-Based Incremental Learning: A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning
Learning Ensembles from Bites: A Scalable and Accurate Approach

The Journal of Machine Learning Research
Stratification for scaling up evolutionary prototype selection

Pattern Recognition Letters
Scalable Representative Instance Selection and Ranking

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 03
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Remembering to forget: a competence-preserving case deletion policy for case-based reasoning systems

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Design of nearest neighbor classifiers: multi-objective approach

International Journal of Approximate Reasoning
A cooperative coevolutionary algorithm for instance selection for instance-based learning

Machine Learning
Ensembles of classifiers from spatially disjoint data

MCS'05 Proceedings of the 6th international conference on Multiple Classifier Systems
Data reduction for instance-based learning using entropy-based partitioning

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part III
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study

IEEE Transactions on Evolutionary Computation

Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts

Artificial Intelligence
IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule

Pattern Recognition
A review of instance selection methods

Artificial Intelligence Review
Instance selection for class imbalanced problems by means of selecting instances more than once

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Multi-selection of instances: A straightforward way to improve evolutionary instance selection

Applied Soft Computing
InstanceRank based on borders for instance selection

Pattern Recognition
A scalable approach to simultaneous evolutionary instance and feature selection

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is being constantly produced. However, although current algorithms are useful for fairly large datasets, scaling problems are found when the number of instances is of hundreds of thousands or millions. In the best case, these algorithms are of efficiency O(n 2), n being the number of instances. When we face huge problems, scalability is an issue, and most algorithms are not applicable. This paper presents a divide-and-conquer recursive approach to the problem of instance selection for instance based learning for very large problems. Our method divides the original training set into small subsets where the instance selection algorithm is applied. Then the selected instances are rejoined in a new training set and the same procedure, partitioning and application of an instance selection algorithm, is repeated. In this way, our approach is based on the philosophy of divide-and-conquer applied in a recursive manner. The proposed method is able to match, and even improve, for the case of storage reduction, the results of well-known standard algorithms with a very significant reduction of execution time. An extensive comparison in 30 datasets form the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 5 huge datasets with from 300,000 to more than a million instances, with very good results and fast execution time.