Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Transforming classifier scores into accurate multiclass probability estimates
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Editorial: special issue on learning from imbalanced data sets
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Quantifying trends accurately despite classifier error and class imbalance
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Estimating class priors in domain adaptation for word sense disambiguation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Statistical Comparisons of Classifiers over Multiple Data Sets
The Journal of Machine Learning Research
The L1-version of the Cramér-von mises test for two-sample comparisons in microarray data analysis
EURASIP Journal on Bioinformatics and Systems Biology
Quantifying counts and costs via classification
Data Mining and Knowledge Discovery
ICIAR '08 Proceedings of the 5th international conference on Image Analysis and Recognition
An experimental comparison of performance measures for classification
Pattern Recognition Letters
Quantification and semi-supervised classification methods for handling changes in class distribution
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Similarity-binning averaging: a generalisation of binning calibration
IDEAL'09 Proceedings of the 10th international conference on Intelligent data engineering and automated learning
Quantification via Probability Estimators
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
A unifying view on dataset shift in classification
Pattern Recognition
Counting positives accurately despite inaccurate classification
ECML'05 Proceedings of the 16th European conference on Machine Learning
Nonparametric multivariate density estimation: a comparative study
IEEE Transactions on Signal Processing
Class distribution estimation based on the Hellinger distance
Information Sciences: an International Journal
Machine Learning: The Art and Science of Algorithms that Make Sense of Data
Machine Learning: The Art and Science of Algorithms that Make Sense of Data
The Journal of Machine Learning Research
On the effect of calibration in classifier combination
Applied Intelligence
Hi-index | 0.00 |
The problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is performed as an aggregation (and possible adjustment) of a single-instance supervised model (e.g., a classifier). However, the study of quantification has been limited to classification, while it is clear that this problem also appears, perhaps even more frequently, with other predictive problems, such as regression. In this case, the goal is to determine a distribution or an aggregated indicator of the output variable for a new unlabelled dataset. In this paper, we introduce a comprehensive new taxonomy of quantification tasks, distinguishing between the estimation of the whole distribution and the estimation of some indicators (summary statistics), for both classification and regression. This distinction is especially useful for regression, since predictions are numerical values that can be aggregated in many different ways, as in multi-dimensional hierarchical data warehouses. We focus on aggregative quantification for regression and see that the approaches borrowed from classification do not work. We present several techniques based on segmentation which are able to produce accurate estimations of the expected value and the distribution of the output variable. We show experimentally that these methods especially excel for the relevant scenarios where training and test distributions dramatically differ.