An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

Authors:
Yufeng Ding;Jeffrey S. Simonoff
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2010

Citing 12
Cited 2

The use of regression methodology for the compromise of confidential information in statistical databases

ACM Transactions on Database Systems (TODS)
C4.5: programs for machine learning

C4.5: programs for machine learning
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Cluster-Based Algorithms for Dealing with Missing Values

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Techniques for Dealing with Missing Values in Classification

IDA '97 Proceedings of the Second International Symposium on Advances in Intelligent Data Analysis, Reasoning about Data
Tree induction vs. logistic regression: a learning-curve analysis

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
"Missing Is Useful': Missing Values in Cost-Sensitive Decision Trees

IEEE Transactions on Knowledge and Data Engineering
Handling Missing Values when Applying Classification Models

The Journal of Machine Learning Research
Large Margin Semi-supervised Learning

The Journal of Machine Learning Research
Good methods for coping with missing data in decision trees

Pattern Recognition Letters
AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES

Applied Artificial Intelligence

On-line classification of data streams with missing values based on reinforcement learning

IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
Predictive analytics in information systems research

MIS Quarterly

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are many different methods used by classification tree algorithms when missing data occur in the predictors, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees applied to binary response data. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, as well as the existence or non-existence of missing values in the testing data, are the most helpful criteria to distinguish different missing data methods. In particular, separate class is clearly the best method to use when the testing set has missing values and the missingness is related to the response variable. A real data set related to modeling bankruptcy of a firm is then analyzed. The paper concludes with discussion of adaptation of these results to logistic regression, and other potential generalizations.