C4.5: programs for machine learning
C4.5: programs for machine learning
Improving Generalization with Active Learning
Machine Learning - Special issue on structured connectionist systems
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Explicitly representing expected cost: an alternative to ROC representation
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
On Active Learning for Data Acquisition
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Active Sampling for Feature Selection
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Proceedings of the 1st international workshop on Utility-based data mining
UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Economical active feature-value acquisition through Expected Utility estimation
UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Effective short-term opponent exploitation in simplified poker
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Learning when training data are costly: the effect of class distribution on tree induction
Journal of Artificial Intelligence Research
Learning and classifying under hard budgets
ECML'05 Proceedings of the 16th European conference on Machine Learning
Knows what it knows: a framework for self-aware learning
Proceedings of the 25th international conference on Machine learning
Data Mining and Knowledge Discovery
Proactive learning: cost-sensitive active learning with multiple imperfect oracles
Proceedings of the 17th ACM conference on Information and knowledge management
A survey of emerging approaches to spam filtering
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
Classification is a well-studied problem in machine learning and data mining. Classifier performance was originally gauged almost exclusively using predictive accuracy. However, as work in the field progressed, more sophisticated measures of classifier utility that better represented the value of the induced knowledge were introduced. Nonetheless, most work still ignored the cost of acquiring training examples, even though this affects the overall utility of a classifier. In this paper we consider the costs of acquiring the training examples in the data mining process; we analyze the impact of the cost of training data on learning, identify the optimal training set size for a given data set, and analyze the performance of several progressive sampling schemes, which, given the cost of the training data, will generate classifiers that come close to maximizing the overall utility.