Statistical analysis with missing data
Statistical analysis with missing data
Structured induction in expert systems
Structured induction in expert systems
The new S language: a programming environment for data analysis and graphics
The new S language: a programming environment for data analysis and graphics
International Journal of Man-Machine Studies - Special Issue: Knowledge Acquisition for Knowledge-based Systems. Part 5
C4.5: programs for machine learning
C4.5: programs for machine learning
Machine Learning
Data preparation for data mining
Data preparation for data mining
Classification and regression: money *can* grow on trees
KDD '99 Tutorial notes of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Validating the ISO/IEC 15504 measures of software development process capability
Journal of Systems and Software
IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Imputation of Missing Data in Industrial Databases
Applied Intelligence
Cluster-Based Algorithms for Dealing with Missing Values
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A Short Note on Safest Default Missingness Mechanism Assumptions
Empirical Software Engineering
Good methods for coping with missing data in decision trees
Pattern Recognition Letters
An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data
The Journal of Machine Learning Research
Ensemble missing data techniques for software effort prediction
Intelligent Data Analysis
Predicting incomplete gene microarray data with the use of supervised learning algorithms
Pattern Recognition Letters
A robust missing value imputation method for noisy data
Applied Intelligence
A classifier ensemble approach for the missing feature problem
Artificial Intelligence in Medicine
Incomplete-case nearest neighbor imputation in software measurement data
Information Sciences: an International Journal
Hi-index | 0.00 |
Increasing the awareness of how incomplete data affects learning and classification accuracy has led to increasing numbers of missing data techniques. This article investigates the robustness and accuracy of seven popular techniques for tolerating incomplete training and test data for different patterns of missing data—different proportions and mechanisms of missing data on resulting tree-based models. The seven missing data techniques were compared by artificially simulating different proportions, patterns, and mechanisms of missing data using 21 complete datasets (i.e., with no missing values) obtained from the University of California, Irvine repository of machine-learning databases (Blake and Merz, 1998). A four-way repeated measures design was employed to analyze the data. The simulation results suggest important differences. All methods have their strengths and weaknesses. However, listwise deletion is substantially inferior to the other six techniques, while multiple imputation, that utilizes the expectation maximization algorithm, represents a superior approach to handling incomplete data. Decision tree single imputation and surrogate variables splitting are more severely impacted by missing values distributed among all attributes compared to when they are only on a single attribute. Otherwise, the imputation—versus model-based imputation procedures gave—reasonably good results although some discrepancies remained. Different techniques for addressing missing values when using decision trees can give substantially diverse results, and must be carefully considered to protect against biases and spurious findings. Multiple imputation should always be used, especially if the data contain many missing values. If few values are missing, any of the missing data techniques might be considered. The choice of technique should be guided by the proportion, pattern, and mechanisms of missing data, especially the latter two. However, the use of older techniques like listwise deletion and mean or mode single imputation is no longer justifiable given the accessibility and ease of use of more advanced techniques, such as multiple imputation and supervised learning imputation.