Cost-Constrained Data Acquisition for Intelligent Data Preparation

Authors:
Xingquan Zhu;Xindong Wu
Affiliations:
IEEE;IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 29
Cited 6

Statistical analysis with missing data

Statistical analysis with missing data
Structured induction in expert systems

Structured induction in expert systems
Algorithms for clustering data

Algorithms for clustering data
Unknown attribute values in induction

Proceedings of the sixth international workshop on Machine learning
Instance-Based Learning Algorithms

Machine Learning
The Use of Background Knowledge in Decision Tree Induction

Machine Learning
Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Information-based objective functions for active data selection

Neural Computation
C4.5: programs for machine learning

C4.5: programs for machine learning
Cost-Sensitive Learning of Classification Knowledge and Its Applications in Robotics

Machine Learning
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Learning to classify incomplete examples

Computational learning theory and natural learning systems: Volume IV
Knowing what doesn't matter: exploiting the omission of irrelevant data

Artificial Intelligence - Special issue on relevance
Data preparation for data mining

Data preparation for data mining
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Understanding the Crucial Role of AttributeInteraction in Data Mining

Artificial Intelligence Review
Data Quality for the Information Age

Data Quality for the Information Age
The CN2 Induction Algorithm

Machine Learning
Induction of Decision Trees

Machine Learning
Learning Belief Networks in the Presence of Missing Values and Hidden Variables

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
On Active Learning for Data Acquisition

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Data Acquisition with Active and Impact-Sensitive Instance Selection

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
A Mathematical Theory of Communication

A Mathematical Theory of Communication
Error detection and impact-sensitive instance ranking in noisy datasets

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm

Journal of Artificial Intelligence Research
Budgeted learning of nailve-bayes classifiers

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Partial example acquisition in cost-sensitive learning

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Bellwether analysis: Searching for cost-effective query-defined predictors in large databases

ACM Transactions on Knowledge Discovery from Data (TKDD)
Cost sensitive classification in data mining

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Fast data acquisition in cost-sensitive learning

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
Repeated labeling using multiple noisy labelers

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values. However, due to the significant cost of doing so and the inherent correlations in the data set, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum驴 performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed data set can maximize its performance? In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique Economical Factor (EF) that seamlessly integrates the cost and the importance (in terms of classification) of each attribute. Then, we will propose a cost-constrained data acquisition model, where active learning, missing value prediction, and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world data sets demonstrate the effectiveness of our method.