On the Dimensions of Data Complexity through Synthetic Data Sets

  • Authors:
  • Núria Macià;Ester Bernadó-Mansilla;Albert Orriols-Puig

  • Affiliations:
  • Grup de Recerca en Sistemes Intel·ligents, Enginyeria i Arquitectura La Salle-Universitat Ramon Llull, Quatre Camins 2, 08022, Barcelona (Spain), {nmacia,esterb,aorriols}@salle.url.edu;Grup de Recerca en Sistemes Intel·ligents, Enginyeria i Arquitectura La Salle-Universitat Ramon Llull, Quatre Camins 2, 08022, Barcelona (Spain), {nmacia,esterb,aorriols}@salle.url.edu;Grup de Recerca en Sistemes Intel·ligents, Enginyeria i Arquitectura La Salle-Universitat Ramon Llull, Quatre Camins 2, 08022, Barcelona (Spain), {nmacia,esterb,aorriols}@salle.url.edu

  • Venue:
  • Proceedings of the 2008 conference on Artificial Intelligence Research and Development: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper deals with the characterization of data complexity and the relationship with the classification accuracy. We study three dimensions of data complexity: the length of the class boundary, the number of features, and the number of instances of the data set. We find that the length of the class boundary is the most relevant dimension of complexity, since it can be used as an estimate of the maximum achievable accuracy rate of a classifier. The number of attributes and the number of instances do not affect classifier accuracy by themselves, if the boundary length is kept constant. The study emphasizes the use of measures revealing the intrinsic structure of data and recommends their use to extract conclusions on classifier behavior and their relative performance in multiple comparison experiments.