A general approach to incorporate data quality matrices into data mining algorithms

  • Authors:
  • Ian Davidson;Ashish Grover;Ashwin Satyanarayana;Giri K. Tayi

  • Affiliations:
  • SUNY Albany, Albany, NY;GE Research, Niskayuna, NY;SUNY Albany, Albany, NY;SUNY Albany, Albany, NY

  • Venue:
  • Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data quality is a central issue for many information-oriented organizations. Recent advances in the data quality field reflect the view that a database is the product of a manufacturing process. While routine errors, such as non-existent zip codes, can be detected and corrected using traditional data cleansing tools, many errors systemic to the manufacturing process cannot be addressed. Therefore, the product of the data manufacturing process is an imprecise recording of information about the entities of interest (i.e. customers, transactions or assets). In this way, the database is only one (flawed) version of the entities it is supposed to represent. Quality assurance systems such as Motorola's Six-Sigma and other continuous improvement methods document the data manufacturing process's shortcomings. A widespread method of documentation is quality matrices. In this paper, we explore the use of the readily available data quality matrices for the data mining classification task. We first illustrate that if we do not factor in these quality matrices, then our results for prediction are sub-optimal. We then suggest a general-purpose ensemble approach that perturbs the data according to these quality matrices to improve the predictive accuracy and show the improvement is due to a reduction in variance.