A general approach to incorporate data quality matrices into data mining algorithms

Authors:
Ian Davidson;Ashish Grover;Ashwin Satyanarayana;Giri K. Tayi
Affiliations:
SUNY Albany, Albany, NY;GE Research, Niskayuna, NY;SUNY Albany, Albany, NY;SUNY Albany, Albany, NY
Venue:
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2004

Citing 8
Cited 1

Methodology for allocating resources for data quality enhancement

Communications of the ACM
Learning in the presence of concept drift and hidden contexts

Machine Learning
Bagging predictors

Machine Learning
Modeling Information Manufacturing Systems to Determine Information Product Quality

Management Science
Enhancing data quality in data warehouse environments

Communications of the ACM
Improving data warehouse and business information quality: methods for reducing costs and increasing profits

Improving data warehouse and business information quality: methods for reducing costs and increasing profits
Data preparation for data mining

Data preparation for data mining
Assessing data quality with control matrices

Communications of the ACM - Information cities

An approach for incorporating quality-based cost---benefit analysis in data warehouse design

Information Systems Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data quality is a central issue for many information-oriented organizations. Recent advances in the data quality field reflect the view that a database is the product of a manufacturing process. While routine errors, such as non-existent zip codes, can be detected and corrected using traditional data cleansing tools, many errors systemic to the manufacturing process cannot be addressed. Therefore, the product of the data manufacturing process is an imprecise recording of information about the entities of interest (i.e. customers, transactions or assets). In this way, the database is only one (flawed) version of the entities it is supposed to represent. Quality assurance systems such as Motorola's Six-Sigma and other continuous improvement methods document the data manufacturing process's shortcomings. A widespread method of documentation is quality matrices. In this paper, we explore the use of the readily available data quality matrices for the data mining classification task. We first illustrate that if we do not factor in these quality matrices, then our results for prediction are sub-optimal. We then suggest a general-purpose ensemble approach that perturbs the data according to these quality matrices to improve the predictive accuracy and show the improvement is due to a reduction in variance.