Methodology for allocating resources for data quality enhancement
Communications of the ACM
Learning in the presence of concept drift and hidden contexts
Machine Learning
Machine Learning
Enhancing data quality in data warehouse environments
Communications of the ACM
Improving data warehouse and business information quality: methods for reducing costs and increasing profits
Data preparation for data mining
Data preparation for data mining
Assessing data quality with control matrices
Communications of the ACM - Information cities
An approach for incorporating quality-based cost---benefit analysis in data warehouse design
Information Systems Frontiers
Hi-index | 0.00 |
Data quality is a central issue for many information-oriented organizations. Recent advances in the data quality field reflect the view that a database is the product of a manufacturing process. While routine errors, such as non-existent zip codes, can be detected and corrected using traditional data cleansing tools, many errors systemic to the manufacturing process cannot be addressed. Therefore, the product of the data manufacturing process is an imprecise recording of information about the entities of interest (i.e. customers, transactions or assets). In this way, the database is only one (flawed) version of the entities it is supposed to represent. Quality assurance systems such as Motorola's Six-Sigma and other continuous improvement methods document the data manufacturing process's shortcomings. A widespread method of documentation is quality matrices. In this paper, we explore the use of the readily available data quality matrices for the data mining classification task. We first illustrate that if we do not factor in these quality matrices, then our results for prediction are sub-optimal. We then suggest a general-purpose ensemble approach that perturbs the data according to these quality matrices to improve the predictive accuracy and show the improvement is due to a reduction in variance.