Maximizing classifier utility when training data is costly

Authors:
Gary M. Weiss;Ye Tian
Affiliations:
Fordham University, Bronx, NY;Fordham University, Bronx, NY
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2006

Citing 11
Cited 4

C4.5: programs for machine learning

C4.5: programs for machine learning
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Explicitly representing expected cost: an alternative to ROC representation

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
On Active Learning for Data Acquisition

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Active Sampling for Feature Selection

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Proceedings of the 1st international workshop on Utility-based data mining

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Economical active feature-value acquisition through Expected Utility estimation

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Effective short-term opponent exploitation in simplified poker

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Learning and classifying under hard budgets

ECML'05 Proceedings of the 16th European conference on Machine Learning

Knows what it knows: a framework for self-aware learning

Proceedings of the 25th international conference on Machine learning
Pessimistic cost-sensitive active learning of decision trees for profit maximizing targeting campaigns

Data Mining and Knowledge Discovery
Proactive learning: cost-sensitive active learning with multiple imperfect oracles

Proceedings of the 17th ACM conference on Information and knowledge management
A survey of emerging approaches to spam filtering

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification is a well-studied problem in machine learning and data mining. Classifier performance was originally gauged almost exclusively using predictive accuracy. However, as work in the field progressed, more sophisticated measures of classifier utility that better represented the value of the induced knowledge were introduced. Nonetheless, most work still ignored the cost of acquiring training examples, even though this affects the overall utility of a classifier. In this paper we consider the costs of acquiring the training examples in the data mining process; we analyze the impact of the cost of training data on learning, identify the optimal training set size for a given data set, and analyze the performance of several progressive sampling schemes, which, given the cost of the training data, will generate classifiers that come close to maximizing the overall utility.