On the Dimensions of Data Complexity through Synthetic Data Sets

Authors:
Núria Macià;Ester Bernadó-Mansilla;Albert Orriols-Puig
Affiliations:
Grup de Recerca en Sistemes Intel·ligents, Enginyeria i Arquitectura La Salle-Universitat Ramon Llull, Quatre Camins 2, 08022, Barcelona (Spain), {nmacia,esterb,aorriols}@salle.url.edu;Grup de Recerca en Sistemes Intel·ligents, Enginyeria i Arquitectura La Salle-Universitat Ramon Llull, Quatre Camins 2, 08022, Barcelona (Spain), {nmacia,esterb,aorriols}@salle.url.edu;Grup de Recerca en Sistemes Intel·ligents, Enginyeria i Arquitectura La Salle-Universitat Ramon Llull, Quatre Camins 2, 08022, Barcelona (Spain), {nmacia,esterb,aorriols}@salle.url.edu
Venue:
Proceedings of the 2008 conference on Artificial Intelligence Research and Development: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence
Year:
2008

Citing 8
Cited 2

Instance-Based Learning Algorithms

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
On Classifier Domains of Competence

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 1 - Volume 01
Data Complexity in Pattern Recognition (Advanced Information and Knowledge Processing)

Data Complexity in Pattern Recognition (Advanced Information and Knowledge Processing)
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

On dataset complexity for case base maintenance

ICCBR'11 Proceedings of the 19th international conference on Case-Based Reasoning Research and Development
Towards UCI+: A mindful repository design

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper deals with the characterization of data complexity and the relationship with the classification accuracy. We study three dimensions of data complexity: the length of the class boundary, the number of features, and the number of instances of the data set. We find that the length of the class boundary is the most relevant dimension of complexity, since it can be used as an estimate of the maximum achievable accuracy rate of a classifier. The number of attributes and the number of instances do not affect classifier accuracy by themselves, if the boundary length is kept constant. The study emphasizes the use of measures revealing the intrinsic structure of data and recommends their use to extract conclusions on classifier behavior and their relative performance in multiple comparison experiments.