Data quality awareness: a case study for cost optimal association rule mining

Authors:
Laure Berti-Équille
Affiliations:
University of Rennes I, Campus Universitaire de Beaulieu, IRISA, 35042, Rennes, France
Venue:
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Year:
2007

Citing 41
Cited 3

Statistical analysis with missing data

Statistical analysis with missing data
Data quality control theory and pragmatics

Data quality control theory and pragmatics
The specification, engineering, and measurement of information systems quality

Journal of Systems and Software
The notion of data and its quality dimensions

Information Processing and Management: an International Journal
Enhancing database correctness: a statistical approach

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A product perspective on total data quality management

Communications of the ACM
Quality information and knowledge

Quality information and knowledge
Improving data warehouse and business information quality: methods for reducing costs and increasing profits

Improving data warehouse and business information quality: methods for reducing costs and increasing profits
Data preparation for data mining

Data preparation for data mining
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data quality: the field guide

Data quality: the field guide
Information quality benchmarks: product and service performance

Communications of the ACM - Supporting community and building social capital
Discovering and reconciling value conflicts for numerical data integration

Information Systems - Data extraction, cleaning and reconciliation
Information and Database Quality

Information and Database Quality
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
A Framework for Analysis of Data Quality Research

IEEE Transactions on Knowledge and Data Engineering
Modeling Completeness versus Consistency Tradeoffs in Information Decision Contexts

IEEE Transactions on Knowledge and Data Engineering
Industrial Conference on Data Mining: Advances in Data Mining, Applications in E-Commerce, Medicine, and Knowledge Management

Industrial Conference on Data Mining: Advances in Data Mining, Applications in E-Commerce, Medicine, and Knowledge Management
Data Quality Requirements Analysis and Modeling

Proceedings of the Ninth International Conference on Data Engineering
Entity Identification in Database Integration

Proceedings of the Ninth International Conference on Data Engineering
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Quality-driven Integration of Heterogenous Information Systems

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Lineage Tracing for General Data Warehouse Transformations

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Architecture and Quality in Data Warehouses

CAiSE '98 Proceedings of the 10th International Conference on Advanced Information Systems Engineering
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Exploratory Data Mining and Data Cleaning

Exploratory Data Mining and Data Cleaning
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting duplicate objects in XML documents

Proceedings of the 2004 international workshop on Information quality in information systems
Tackling inconsistencies in data integration through source preferences

Proceedings of the 2004 international workshop on Information quality in information systems
Mining for patterns in contradictory data

Proceedings of the 2004 international workshop on Information quality in information systems
A framework for analysis of data freshness

Proceedings of the 2004 international workshop on Information quality in information systems
Methods for evaluating and creating data quality

Information Systems - Special issue: Data quality in cooperative information systems
Mining Customer Value: From Association Rules to Direct Marketing

Data Mining and Knowledge Discovery
A framework for the design of ETL scenarios

CAiSE'03 Proceedings of the 15th international conference on Advanced information systems engineering
Quality-driven query answering for integrated information systems

Quality-driven query answering for integrated information systems

Adaptive learning of dynamic Bayesian networks with changing structures by detecting geometric structures of time series

Knowledge and Information Systems
Modified algorithms for synthesizing high-frequency rules from different data sources

Knowledge and Information Systems
Mining fuzzy association rules from uncertain data

Knowledge and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called “interesting” rule noted LHS→ RHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.