Maximizing classifier utility when there are data acquisition and modeling costs

Authors:
Gary M. Weiss;Ye Tian
Affiliations:
Department of Computer and Information Science, Fordham University, Bronx, USA 10458;Department of Computer and Information Science, Fordham University, Bronx, USA 10458
Venue:
Data Mining and Knowledge Discovery
Year:
2008

Citing 19
Cited 6

C4.5: programs for machine learning

C4.5: programs for machine learning
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
A Comparative Analysis of Methods for Pruning Decision Trees

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Classification for Imprecise Environments

Machine Learning
Information Retrieval

Information Retrieval
Learning cost-sensitive active classifiers

Artificial Intelligence
Instability of decision tree classification algorithms

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Active Sampling for Feature Selection

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management

Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management
KDD-Cup 2004: results and analysis

ACM SIGKDD Explorations Newsletter
Economical active feature-value acquisition through Expected Utility estimation

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Report on UBDM-05: Workshop on Utility-Based Data Mining

ACM SIGKDD Explorations Newsletter
Cost curves: An improved method for visualizing classifier performance

Machine Learning
UBDM 2006: Utility-Based Data Mining 2006 workshop report

ACM SIGKDD Explorations Newsletter
Effective short-term opponent exploitation in simplified poker

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Learning and classifying under hard budgets

ECML'05 Proceedings of the 16th European conference on Machine Learning

Guest editorial: special issue on utility-based data mining

Data Mining and Knowledge Discovery
Improving data mining utility with projective sampling

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Tuning metaheuristics: A data mining based approach for particle swarm optimization

Expert Systems with Applications: An International Journal
Addressing the classification with imbalanced data: open problems and new challenges on class distribution

HAIS'11 Proceedings of the 6th international conference on Hybrid artificial intelligent systems - Volume Part I
Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches

Knowledge-Based Systems
On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification is a well-studied problem in data mining. Classification performance was originally gauged almost exclusively using predictive accuracy, but as work in the field progressed, more sophisticated measures of classifier utility that better represented the value of the induced knowledge were introduced. Nonetheless, most work still ignored the cost of acquiring training examples, even though this cost impacts the total utility of the data mining process. In this article we analyze the relationship between the number of acquired training examples and the utility of the data mining process and, given the necessary cost information, we determine the number of training examples that yields the optimum overall performance. We then extend this analysis to include the cost of model induction--measured in terms of the CPU time required to generate the model. While our cost model does not take into account all possible costs, our analysis provides some useful insights and a template for future analyses using more sophisticated cost models. Because our analysis is based on experiments that acquire the full set of training examples, it cannot directly be used to find a classifier with optimal or near-optimal total utility. To address this issue we introduce two progressive sampling strategies that are empirically shown to produce classifiers with near-optimal total utility.