Improving data mining utility with projective sampling

Authors:
Mark Last
Affiliations:
Ben-Gurion University of the Negev, Beer-Sheva, Israel
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 12
Cited 1

C4.5: programs for machine learning

C4.5: programs for machine learning
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A Compact and Accurate Model for Classification

IEEE Transactions on Knowledge and Data Engineering
Active Sampling for Class Probability Estimation and Ranking

Machine Learning
Convex Optimization

Convex Optimization
KDD-Cup 2004: results and analysis

ACM SIGKDD Explorations Newsletter
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
YALE: rapid prototyping for complex data mining tasks

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Partial example acquisition in cost-sensitive learning

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Predicting and Optimizing Classifier Utility with the Power Law

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Maximizing classifier utility when there are data acquisition and modeling costs

Data Mining and Knowledge Discovery

Classification of infectious diseases based on chemiluminescent signatures of phagocytes in whole blood

Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Overall performance of the data mining process depends not just on the value of the induced knowledge but also on various costs of the process itself such as the cost of acquiring and pre-processing training examples, the CPU cost of model induction, and the cost of committed errors. Recently, several progressive sampling strategies for maximizing the overall data mining utility have been proposed. All these strategies are based on repeated acquisitions of additional training examples until a utility decrease is observed. In this paper, we present an alternative, projective sampling strategy, which fits functions to a partial learning curve and a partial run-time curve obtained from a small subset of potentially available data and then uses these projected functions to analytically estimate the optimal training set size. The proposed approach is evaluated on a variety of benchmark datasets using the RapidMiner environment for machine learning and data mining processes. The results show that the learning and run-time curves projected from only several data points can lead to a cheaper data mining process than the common progressive sampling methods.