Editorial: Large scale instance selection by means of federal instance selection

Authors:
Aida de Haro-García;Nicolás García-Pedrajas;Juan Antonio Romero del Castillo
Affiliations:
-;-;-
Venue:
Data & Knowledge Engineering
Year:
2012

Citing 24
Cited 1

The grand tour: a tool for viewing multidimensional data

SIAM Journal on Scientific and Statistical Computing
Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Advances in Instance Selection for Instance-Based Learning Algorithms

Data Mining and Knowledge Discovery
Supporting internet-scale multi-agent systems

Data & Knowledge Engineering - DKE 40
A selective sampling approach to active feature selection

Artificial Intelligence
Stratification for scaling up evolutionary prototype selection

Pattern Recognition Letters
Online clustering of parallel data streams

Data & Knowledge Engineering
Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability

Data & Knowledge Engineering
pPOP: Fast yet accurate parallel hierarchical clustering using partitioning

Data & Knowledge Engineering
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
Fast Nearest Neighbor Condensation for Large Data Sets Classification

IEEE Transactions on Knowledge and Data Engineering
A memetic algorithm for evolutionary prototype selection: A scaling up approach

Pattern Recognition
Evolutionary rule-based systems for imbalanced data sets

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Evolutionary and Metaheuristics based Data Mining (EMBDM); Guest Editors: José A. Gámez, María J. del Jesús, José M. Puerta
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy

Evolutionary Computation
Constructing ensembles of classifiers by means of weighted instance selection

IEEE Transactions on Neural Networks
Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts

Artificial Intelligence
Prototype selection algorithms for distributed learning

Pattern Recognition
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study

IEEE Transactions on Evolutionary Computation
The reduced nearest neighbor rule (Corresp.)

IEEE Transactions on Information Theory

On instance selection in audio based emotion recognition

ANNPR'12 Proceedings of the 5th INNS IAPR TC 3 GIRPR conference on Artificial Neural Networks in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. However, although current algorithms are useful for fairly large datasets, many scaling problems are found when the number of instances is hundreds of thousands or millions. Most of the widely used instance selection algorithms are of complexity at least O(n^2), n being the number of instances. When we face very large problems, the scalability becomes an issue, and most of the algorithms are not applicable. This paper presents a methodology for scaling up instance selection algorithms by means of a parallel procedure that performs instance selection on small subsets of the original dataset. The results obtained with the application of instance selection to small subsets are combined using a voting scheme. The method achieves a very good performance in terms of testing error and storage reduction, while the execution time of the process is decreased very significantly. The parallel algorithm also removes any kind of constraint imposed by memory size, as the whole dataset does not need to be stored in memory. The usefulness of our method is shown by an extensive comparison using 35 datasets of medium and large sizes from the UCI Machine Learning Repository. Additionally, our method is applied to eight very large datasets with very good results and fast execution time.