Systematic development of data mining-based data quality tools

  • Authors:
  • Dominik Luebbers;Udo Grimmer;Matthias Jarke

  • Affiliations:
  • RWTH Aachen, Informatik V (Information Systems), Aachen, Germany;DaimlerChrysler AG, Research & Technology, Ulm, Germany;RWTH Aachen, Informatik V (Information Systems), Aachen, Germany and Fraunhofer FIT, Schloss Birlinghoven, Sankt Augustin, Germany

  • Venue:
  • VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data quality problems have been a persistent concern especially for large historically grown databases. If maintained over long periods, interpretation and usage of their schemas often shifts. Therefore, traditional data scrubbing techniques based on existing schema and integrity constraint documentation are hardly applicable. So-called data auditing environments circumvent this problem by using machine learning techniques in order to induce semantically meaningful structures from the actual data, and then classifying outliers that do not fit the induced schema as potential errors. However, as the quality of the analyzed database is a-priori unknown, the design of data auditing environments requires special methods for the calibration of error measurements based on the induced schema. In this paper, we present a data audit test generator that systematically generates and pollutes artificial benchmark databases for this purpose. The test generator has been implemented as part of a data auditing environment based on the well-known machine learning algorithm C4.5. Validation in the partial quality audit of a large service-related database at Daimler-Chrysler shows the usefulness of the approach as a complement to standard data scrubbing.