Using Data Mining Techniques to Discover Bias Patterns in Missing Data

Authors:
Monica Chiarini Tremblay;Kaushik Dutta;Debra Vandermeer
Affiliations:
Florida International University;Florida International University;Florida International University
Venue:
Journal of Data and Information Quality (JDIQ)
Year:
2010

Citing 20
Cited 0

Algorithms for clustering data

Algorithms for clustering data
C4.5: programs for machine learning

C4.5: programs for machine learning
Toward quality data: an attribute-based approach

Decision Support Systems - Special issue on information technologies and systems
The KDD process for extracting useful knowledge from volumes of data

Communications of the ACM
Data quality in context

Communications of the ACM
Data mining: concepts and techniques

Data mining: concepts and techniques
A Framework for Analysis of Data Quality Research

IEEE Transactions on Knowledge and Data Engineering
Finding Association Rules That Trade Support Optimally against Confidence

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Mining Optimal Class Association Rule Set

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
A Recycle Technique of Association Rule for Missing Value Completion

AINA '03 Proceedings of the 17th International Conference on Advanced Information Networking and Applications
Dimensionality Reduction of Unsupervised Data

ICTAI '97 Proceedings of the 9th International Conference on Tools with Artificial Intelligence
The perils of data misreporting

Communications of the ACM - Blueprint for the future of high-performance networking
The Impact of Experience and Time on the Use of Data Quality Information in Decision Making

Information Systems Research
Assessing Data Quality for Information Products: Impact of Selection, Projection, and Cartesian Product

Management Science
Beyond accuracy: what data quality means to data consumers

Journal of Management Information Systems
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions

Decision Support Systems
Utility-driven assessment of data quality

ACM SIGMIS Database
Information supply chain: a unified framework for information-sharing

ISI'05 Proceedings of the 2005 IEEE international conference on Intelligence and Security Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In today’s data-rich environment, decision makers draw conclusions from data repositories that may contain data quality problems. In this context, missing data is an important and known problem, since it can seriously affect the accuracy of conclusions drawn. Researchers have described several approaches for dealing with missing data, primarily attempting to infer values or estimate the impact of missing data on conclusions. However, few have considered approaches to characterize patterns of bias in missing data, that is, to determine the specific attributes that predict the missingness of data values. Knowledge of the specific systematic bias patterns in the incidence of missing data can help analysts more accurately assess the quality of conclusions drawn from data sets with missing data. This research proposes a methodology to combine a number of Knowledge Discovery and Data Mining techniques, including association rule mining, to discover patterns in related attribute values that help characterize these bias patterns. We demonstrate the efficacy of our proposed approach by applying it on a demo census dataset seeded with biased missing data. The experimental results show that our approach was able to find seeded biases and filter out most seeded noise.