Learner excellence biased by data set selection: A case for data characterisation and artificial data sets

Authors:
NúRia Macií;Ester Bernadó-Mansilla;Albert Orriols-Puig;Tin Kam Ho
Affiliations:
Grup de Recerca en Sistemes Intelligents, La Salle - Universitat Ramon Llull, C/ Quatre Camins, 2, 08022 Barcelona, Spain;Grup de Recerca en Sistemes Intelligents, La Salle - Universitat Ramon Llull, C/ Quatre Camins, 2, 08022 Barcelona, Spain;Grup de Recerca en Sistemes Intelligents, La Salle - Universitat Ramon Llull, C/ Quatre Camins, 2, 08022 Barcelona, Spain;Bell Laboratories, Alcatel-Lucent, 600 Mountain Ave., Murray Hill, NJ 07974-0636, USA
Venue:
Pattern Recognition
Year:
2013

Citing 22
Cited 2

Instance-Based Learning Algorithms

Machine Learning
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
The nature of statistical learning theory

The nature of statistical learning theory
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Random Forests

Machine Learning
Pretopological Approach for Supervised Learning

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume IV-Volume 7472 - Volume 7472
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation)

Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation)
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
An analysis of how training data complexity affects the nearest neighbor classifiers

Pattern Analysis & Applications
Top 10 algorithms in data mining

Knowledge and Information Systems
The lack of a priori distinctions between learning algorithms

Neural Computation
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Domains of competence of fuzzy rule based classification systems with data complexity measures: A case of study using a fuzzy hybrid genetic based machine learning method

Fuzzy Sets and Systems
In search of targeted-complexity problems

Proceedings of the 12th annual conference on Genetic and evolutionary computation
The landscape contest at ICPR 2010

ICPR'10 Proceedings of the 20th International conference on Recognizing patterns in signals, speech, images, and videos
Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Intelligent Systems, Design and Applications (ISDA 2009)
Fuzzy knowledge representation study for incremental learning in data streams and classification problems

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Evolutionary Fuzzy Systems
A fast and elitist multiobjective genetic algorithm: NSGA-II

IEEE Transactions on Evolutionary Computation
Domain of competence of XCS classifier system in complexity measurement space

IEEE Transactions on Evolutionary Computation

An n-spheres based synthetic data generator for supervised classification

IWANN'13 Proceedings of the 12th international conference on Artificial Neural Networks: advances in computational intelligence - Volume Part I
Towards UCI+: A mindful repository design

Information Sciences: an International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

The excellence of a given learner is usually claimed through a performance comparison with other learners over a collection of data sets. Too often, researchers are not aware of the impact of their data selection on the results. Their test beds are small, and the selection of the data sets is not supported by any previous data analysis. Conclusions drawn on such test beds cannot be generalised, because particular data characteristics may favour certain learners unnoticeably. This work raises these issues and proposes the characterisation of data sets using complexity measures, which can be helpful for both guiding experimental design and explaining the behaviour of learners.