Cost-Based Sampling of Individual Instances

Authors:
William Klement;Peter Flach;Nathalie Japkowicz;Stan Matwin
Affiliations:
School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada K1N 6N5;Department of Computer Science, University of Bristol, Bristol, United Kingdom BS8 1UB;School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada K1N 6N5;School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada K1N 6N5 and Institute of Computer Science, Polish Academy of Sciences, Poland
Venue:
Canadian AI '09 Proceedings of the 22nd Canadian Conference on Artificial Intelligence: Advances in Artificial Intelligence
Year:
2009

Citing 14
Cited 1

Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Class Probability Estimation and Cost-Sensitive Classification Decisions

ECML '02 Proceedings of the 13th European Conference on Machine Learning
Improving Minority Class Prediction Using Case-Specific Feature Weights

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
The Case against Accuracy Estimation for Comparing Induction Algorithms

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
AdaCost: Misclassification Cost-Sensitive Boosting

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Cost-Sensitive Learning by Cost-Proportionate Example Weighting

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
AUC: a statistically consistent and more discriminating measure than accuracy

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Severe class imbalance: why better algorithms aren't the answer

ECML'05 Proceedings of the 16th European conference on Machine Learning

Classifying severely imbalanced data

Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many practical domains, misclassification costs can differ greatly and may be represented by class ratios, however, most learning algorithms struggle with skewed class distributions. The difficulty is attributed to designing classifiers to maximize the accuracy. Researchers call for using several techniques to address this problem including; under-sampling the majority class, employing a probabilistic algorithm, and adjusting the classification threshold. In this paper, we propose a general sampling approach that assigns weights to individual instances according to the cost function. This approach helps reveal the relationship between classification performance and class ratios and allows the identification of an appropriate class distribution for which, the learning method achieves a reasonable performance on the data. Our results show that combining an ensemble of Naive Bayes classifiers with threshold selection and under-sampling techniques works well for imbalanced data.