An iterative refinement approach for data cleaning

Authors:
Amitava Karmaker;Stephen Kwek
Affiliations:
Department of Computer Science, University of Texas at San Antonio, TX 78249, USA. E-mail: {akarmake,kwek}@cs.utsa.edu;Department of Computer Science, University of Texas at San Antonio, TX 78249, USA. E-mail: {akarmake,kwek}@cs.utsa.edu
Venue:
Intelligent Data Analysis
Year:
2007

Citing 11
Cited 0

Statistical analysis with missing data

Statistical analysis with missing data
Unknown attribute values in induction

Proceedings of the sixth international workshop on Machine learning
Data mining: concepts and techniques

Data mining: concepts and techniques
Imputation of Missing Data in Industrial Databases

Applied Intelligence
Maximum Consistency of Incomplete Datavia Non-Invasive Imputation

Artificial Intelligence Review
Induction of Decision Trees

Machine Learning
Preprocessing of Missing Values Using Robust Association Rules

PKDD '98 Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery
Approximate Association Rule Mining

Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference
A Grey-Based Nearest Neighbor Approach for Missing Attribute Value Prediction

Applied Intelligence
Using Association Rules for Completing Missing Data

HIS '04 Proceedings of the Fourth International Conference on Hybrid Intelligent Systems
Missing values prediction with K2

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data cleaning is an important step in the data mining process. Successful data mining applications require good quality data. In this paper, we propose a data cleaning technique that smoothes out a substantial amount of attribute noise and handles missing attribute values as well. Our approach is inspired by the Expectation-Maximization (EM) algorithm. It iteratively refines each attribute-value using a predictor constructed from the previously refined values (known values in the first iteration). We demonstrate the effectiveness of our technique in smoothing out attribute noise and corroborate the efficacy of our technique by showing improved classification accuracy on a number of real world data sets from UCI repository [2]. Moreover, we show that our technique can easily be adapted to fill up missing attribute-values in classification problems more effectively than other standard approaches.