Aggregative quantification for regression

  • Authors:
  • Antonio Bella;Cèsar Ferri;José Hernández-Orallo;María José Ramírez-Quintana

  • Affiliations:
  • DSIC-ELP, Universitat Politècnica de València, Valencia, Spain 46022;DSIC-ELP, Universitat Politècnica de València, Valencia, Spain 46022;DSIC-ELP, Universitat Politècnica de València, Valencia, Spain 46022;DSIC-ELP, Universitat Politècnica de València, Valencia, Spain 46022

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is performed as an aggregation (and possible adjustment) of a single-instance supervised model (e.g., a classifier). However, the study of quantification has been limited to classification, while it is clear that this problem also appears, perhaps even more frequently, with other predictive problems, such as regression. In this case, the goal is to determine a distribution or an aggregated indicator of the output variable for a new unlabelled dataset. In this paper, we introduce a comprehensive new taxonomy of quantification tasks, distinguishing between the estimation of the whole distribution and the estimation of some indicators (summary statistics), for both classification and regression. This distinction is especially useful for regression, since predictions are numerical values that can be aggregated in many different ways, as in multi-dimensional hierarchical data warehouses. We focus on aggregative quantification for regression and see that the approaches borrowed from classification do not work. We present several techniques based on segmentation which are able to produce accurate estimations of the expected value and the distribution of the output variable. We show experimentally that these methods especially excel for the relevant scenarios where training and test distributions dramatically differ.