AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES

Authors:
Bhekisipho Twala
Affiliations:
Modelling and Digital Intelligence, CSIR, Pretoria, South Africa
Venue:
Applied Artificial Intelligence
Year:
2009

Citing 14
Cited 8

Statistical analysis with missing data

Statistical analysis with missing data
Structured induction in expert systems

Structured induction in expert systems
The new S language: a programming environment for data analysis and graphics

The new S language: a programming environment for data analysis and graphics
Simplifying decision trees

International Journal of Man-Machine Studies - Special Issue: Knowledge Acquisition for Knowledge-based Systems. Part 5
C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
Data preparation for data mining

Data preparation for data mining
Classification and regression: money *can* grow on trees

KDD '99 Tutorial notes of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Validating the ISO/IEC 15504 measures of software development process capability

Journal of Systems and Software
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Imputation of Missing Data in Industrial Databases

Applied Intelligence
Cluster-Based Algorithms for Dealing with Missing Values

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A Short Note on Safest Default Missingness Mechanism Assumptions

Empirical Software Engineering
Good methods for coping with missing data in decision trees

Pattern Recognition Letters

An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

The Journal of Machine Learning Research
Ensemble missing data techniques for software effort prediction

Intelligent Data Analysis
Predicting incomplete gene microarray data with the use of supervised learning algorithms

Pattern Recognition Letters
A robust missing value imputation method for noisy data

Applied Intelligence
A classifier ensemble approach for the missing feature problem

Artificial Intelligence in Medicine
Partial imputation of unseen records to improve classification using a hybrid multi-layered artificial immune system and genetic algorithm

Applied Soft Computing
Incomplete-case nearest neighbor imputation in software measurement data

Information Sciences: an International Journal
Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Increasing the awareness of how incomplete data affects learning and classification accuracy has led to increasing numbers of missing data techniques. This article investigates the robustness and accuracy of seven popular techniques for tolerating incomplete training and test data for different patterns of missing data—different proportions and mechanisms of missing data on resulting tree-based models. The seven missing data techniques were compared by artificially simulating different proportions, patterns, and mechanisms of missing data using 21 complete datasets (i.e., with no missing values) obtained from the University of California, Irvine repository of machine-learning databases (Blake and Merz, 1998). A four-way repeated measures design was employed to analyze the data. The simulation results suggest important differences. All methods have their strengths and weaknesses. However, listwise deletion is substantially inferior to the other six techniques, while multiple imputation, that utilizes the expectation maximization algorithm, represents a superior approach to handling incomplete data. Decision tree single imputation and surrogate variables splitting are more severely impacted by missing values distributed among all attributes compared to when they are only on a single attribute. Otherwise, the imputation—versus model-based imputation procedures gave—reasonably good results although some discrepancies remained. Different techniques for addressing missing values when using decision trees can give substantially diverse results, and must be carefully considered to protect against biases and spurious findings. Multiple imputation should always be used, especially if the data contain many missing values. If few values are missing, any of the missing data techniques might be considered. The choice of technique should be guided by the proportion, pattern, and mechanisms of missing data, especially the latter two. However, the use of older techniques like listwise deletion and mean or mode single imputation is no longer justifiable given the accessibility and ease of use of more advanced techniques, such as multiple imputation and supervised learning imputation.