C4.5: programs for machine learning
C4.5: programs for machine learning
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
An overview of data warehousing and OLAP technology
ACM SIGMOD Record
A product perspective on total data quality management
Communications of the ACM
Quality information and knowledge
Quality information and knowledge
Improving data warehouse and business information quality: methods for reducing costs and increasing profits
LOF: identifying density-based local outliers
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
AJAX: an extensible data cleaning tool
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data Mining and Knowledge Discovery: Making Sense Out of Data
IEEE Expert: Intelligent Systems and Their Applications
Machine Learning
Generalizing from Case studies: A Case Study
ML '92 Proceedings of the Ninth International Workshop on Machine Learning
Architecture and Quality in Data Warehouses
CAiSE '98 Proceedings of the 10th International Conference on Advanced Information Systems Engineering
Design and Analysis of Quality Information for Data Warehouses
ER '98 Proceedings of the 17th International Conference on Conceptual Modeling
Datenqualitätsmanagement in Data Warehouse-Umgebungen
Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), 9. GI-Fachtagung,
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Generating data quality rules and integration into ETL process
Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP
Integrating induction and deduction for noisy data mining
Information Sciences: an International Journal
Quality-Aware association rule mining
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Hi-index | 0.00 |
Data quality problems have been a persistent concern especially for large historically grown databases. If maintained over long periods, interpretation and usage of their schemas often shifts. Therefore, traditional data scrubbing techniques based on existing schema and integrity constraint documentation are hardly applicable. So-called data auditing environments circumvent this problem by using machine learning techniques in order to induce semantically meaningful structures from the actual data, and then classifying outliers that do not fit the induced schema as potential errors. However, as the quality of the analyzed database is a-priori unknown, the design of data auditing environments requires special methods for the calibration of error measurements based on the induced schema. In this paper, we present a data audit test generator that systematically generates and pollutes artificial benchmark databases for this purpose. The test generator has been implemented as part of a data auditing environment based on the well-known machine learning algorithm C4.5. Validation in the partial quality audit of a large service-related database at Daimler-Chrysler shows the usefulness of the approach as a complement to standard data scrubbing.