An Efficient and Robust Approach for Discovering Data Quality Rules

  • Authors:
  • Peter Z. Yeh;Colin A. Puri

  • Affiliations:
  • -;-

  • Venue:
  • ICTAI '10 Proceedings of the 2010 22nd IEEE International Conference on Tools with Artificial Intelligence - Volume 01
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Poor quality data is a growing problem that affects many enterprises across all aspects of their business ranging from operational efficiency to revenue protection. Moreover, this problem is costly to fix because significant effort and resources are required to identify a comprehensive set of rules that can detect (and correct) data defects along various data quality dimensions such as consistency, conformity, and more. Hence, many organizations employ only basic data quality rules that check for null values, format, etc. in efforts such as data profiling and data cleansing; and ignore rules that are needed to detect deeper problems such as inconsistent values across interdependent attributes. This oversight can lead to numerous problems such as inaccurate reporting of key metrics used to inform critical decisions or derive business insights. In this paper, we present an approach that efficiently and robustly discovers data quality rules -- in particular conditional functional dependencies -- for detecting inconsistencies in data and hence improves data quality along the critical dimension of consistency. We evaluate our approach empirically on several real-world data sets. We show that our approach performs well on these data sets for metrics such as precision and recall. We also compare our approach to an established solution and show that our approach outperforms this solution for the same metrics. Finally, we show that our approach scales efficiently with the number of records, the number of attributes, and the domain size.