Towards UCI+: A mindful repository design

Authors:
Núria Macií;Ester Bernadó-Mansilla
Affiliations:
-;-
Venue:
Information Sciences: an International Journal
Year:
2014

Citing 25
Cited 0

Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
Characterizing the applicability of classification algorithms using meta-level learning

ECML-94 Proceedings of the European conference on machine learning on Machine Learning
Machine learning, neural and statistical classification

Machine learning, neural and statistical classification
Learning in the “Real World”

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms

Machine Learning
Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
On Comparing Classifiers: Pitfalls toAvoid and a Recommended Approach

Data Mining and Knowledge Discovery
Improved Dataset Characterisation for Meta-learning

DS '02 Proceedings of the 5th International Conference on Discovery Science
Measuring the similarity of protein structures by means of the universal similarity metric

Bioinformatics
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
An analysis of how training data complexity affects the nearest neighbor classifiers

Pattern Analysis & Applications
The lack of a priori distinctions between learning algorithms

Neural Computation
Mindful: A framework for Meta-INDuctive neuro-FUzzy Learning

Information Sciences: an International Journal
KEEL: a software tool to assess evolutionary algorithms for data mining problems

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Evolutionary and Metaheuristics based Data Mining (EMBDM); Guest Editors: José A. Gámez, María J. del Jesús, José M. Puerta
On the Dimensions of Data Complexity through Synthetic Data Sets

Proceedings of the 2008 conference on Artificial Intelligence Research and Development: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence
In search of targeted-complexity problems

Proceedings of the 12th annual conference on Genetic and evolutionary computation
Feature-based dissimilarity space classification

ICPR'10 Proceedings of the 20th International conference on Recognizing patterns in signals, speech, images, and videos
The changing science of machine learning

Machine Learning
Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Intelligent Systems, Design and Applications (ISDA 2009)
A fast and elitist multiobjective genetic algorithm: NSGA-II

IEEE Transactions on Evolutionary Computation
Domain of competence of XCS classifier system in complexity measurement space

IEEE Transactions on Evolutionary Computation
Model discrimination using an algorithmic information criterion

Automatica (Journal of IFAC)
Learner excellence biased by data set selection: A case for data characterisation and artificial data sets

Pattern Recognition

Quantified Score

Hi-index	0.07

Visualization

Abstract

Public repositories have contributed to the maturation of experimental methodology in machine learning. Publicly available data sets have allowed researchers to empirically assess their learners and, jointly with open source machine learning software, they have favoured the emergence of comparative analyses of learners' performance over a common framework. These studies have brought standard procedures to evaluate machine learning techniques. However, current claims-such as the superiority of enhanced algorithms-are biased by unsustained assumptions made throughout some praxes. In this paper, the early steps of the methodology, which refer to data set selection, are inspected. Particularly, the exploitation of the most popular data repository in machine learning-the UCI repository-is examined. We analyse the type, complexity, and use of UCI data sets. The study recommends the design of a mindful data repository, UCI+, which should include a set of properly characterised data sets consisting of a complete and representative sample of real-world problems, enriched with artificial benchmarks. The ultimate goal of the UCI+ is to lay the foundations towards a well-supported methodology for learner assessment.