A decision tree-based missing value imputation technique for data pre-processing

Authors:
Geaur Rahman;Zahidul Islam
Affiliations:
Charles Sturt University, Wagga Wagga, Australia;Charles Sturt University, Wagga Wagga, Australia
Venue:
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Year:
2011

Citing 11
Cited 3

Statistical analysis with missing data

Statistical analysis with missing data
C4.5: programs for machine learning

C4.5: programs for machine learning
Data preparation for data mining

Data preparation for data mining
A General Additive Data Perturbation Method for Database Security

Management Science
Induction of Decision Trees

Machine Learning
Data Swapping: Balancing Privacy against Precision in Mining for Logic Rules

DaWaK '99 Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Cleaning uncertain data with quality guarantees

Proceedings of the VLDB Endowment
The regularized EM algorithm

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Improved use of continuous attributes in C4.5

Journal of Artificial Intelligence Research
EXPLORE: a novel decision tree classification algorithm

BNCOD'10 Proceedings of the 27th British national conference on Data Security and Security Data

An enhanced secure preserving for pre-processed data using DMI and PCRBAC algorithm

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques

Knowledge-Based Systems
FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data pre-processing plays a vital role in data mining for ensuring good quality of data. In general data pre-processing tasks include imputation of missing values, identification of outliers, smoothening out of noisy data and correction of inconsistent data. In this paper, we present an efficient missing value imputation technique called DMI, which makes use of a decision tree and expectation maximization (EM) algorithm. We argue that the correlations among attributes within a horizontal partition of a data set can be higher than the correlations over the whole data set. For some existing algorithms such as EM based imputation (EMI) accuracy of imputation is expected to be better for a data set having higher correlations than a data set having lower correlations. Therefore, our technique (DMI) applies EMI on various horizontal segments (of a data set) where correlations among attributes are high. We evaluate DMI on two publicly available natural data sets by comparing its performance with the performance of EMI. We use various patterns of missing values each having different missing ratios up to 10%. Several evaluation criteria such as coefficient of determination (R2), Index of agreement (d2) and root mean squared error (RMSE) are used. Our initial experimental results indicate that DMI performs significantly better than EMI.