Systematic development of data mining-based data quality tools

Authors:
Dominik Luebbers;Udo Grimmer;Matthias Jarke
Affiliations:
RWTH Aachen, Informatik V (Information Systems), Aachen, Germany;DaimlerChrysler AG, Research & Technology, Ulm, Germany;RWTH Aachen, Informatik V (Information Systems), Aachen, Germany and Fraunhofer FIT, Schloss Birlinghoven, Sankt Augustin, Germany
Venue:
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Year:
2003

Citing 15
Cited 3

C4.5: programs for machine learning

C4.5: programs for machine learning
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
An overview of data warehousing and OLAP technology

ACM SIGMOD Record
A product perspective on total data quality management

Communications of the ACM
Quality information and knowledge

Quality information and knowledge
Improving data warehouse and business information quality: methods for reducing costs and increasing profits

Improving data warehouse and business information quality: methods for reducing costs and increasing profits
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
AJAX: an extensible data cleaning tool

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data Mining and Knowledge Discovery: Making Sense Out of Data

IEEE Expert: Intelligent Systems and Their Applications
Induction of Decision Trees

Machine Learning
Generalizing from Case studies: A Case Study

ML '92 Proceedings of the Ninth International Workshop on Machine Learning
Architecture and Quality in Data Warehouses

CAiSE '98 Proceedings of the 10th International Conference on Advanced Information Systems Engineering
Design and Analysis of Quality Information for Data Warehouses

ER '98 Proceedings of the 17th International Conference on Conceptual Modeling
Datenqualitätsmanagement in Data Warehouse-Umgebungen

Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), 9. GI-Fachtagung,
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases

Generating data quality rules and integration into ETL process

Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP
Integrating induction and deduction for noisy data mining

Information Sciences: an International Journal
Quality-Aware association rule mining

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data quality problems have been a persistent concern especially for large historically grown databases. If maintained over long periods, interpretation and usage of their schemas often shifts. Therefore, traditional data scrubbing techniques based on existing schema and integrity constraint documentation are hardly applicable. So-called data auditing environments circumvent this problem by using machine learning techniques in order to induce semantically meaningful structures from the actual data, and then classifying outliers that do not fit the induced schema as potential errors. However, as the quality of the analyzed database is a-priori unknown, the design of data auditing environments requires special methods for the calibration of error measurements based on the induced schema. In this paper, we present a data audit test generator that systematically generates and pollutes artificial benchmark databases for this purpose. The test generator has been implemented as part of a data auditing environment based on the well-known machine learning algorithm C4.5. Validation in the partial quality audit of a large service-related database at Daimler-Chrysler shows the usefulness of the approach as a complement to standard data scrubbing.