ACM Transactions on Database Systems (TODS)
C4.5: programs for machine learning
C4.5: programs for machine learning
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation
PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Cluster-Based Algorithms for Dealing with Missing Values
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Techniques for Dealing with Missing Values in Classification
IDA '97 Proceedings of the Second International Symposium on Advances in Intelligent Data Analysis, Reasoning about Data
Tree induction vs. logistic regression: a learning-curve analysis
The Journal of Machine Learning Research
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
"Missing Is Useful': Missing Values in Cost-Sensitive Decision Trees
IEEE Transactions on Knowledge and Data Engineering
Handling Missing Values when Applying Classification Models
The Journal of Machine Learning Research
Large Margin Semi-supervised Learning
The Journal of Machine Learning Research
Good methods for coping with missing data in decision trees
Pattern Recognition Letters
AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES
Applied Artificial Intelligence
On-line classification of data streams with missing values based on reinforcement learning
IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
Predictive analytics in information systems research
MIS Quarterly
Hi-index | 0.00 |
There are many different methods used by classification tree algorithms when missing data occur in the predictors, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees applied to binary response data. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, as well as the existence or non-existence of missing values in the testing data, are the most helpful criteria to distinguish different missing data methods. In particular, separate class is clearly the best method to use when the testing set has missing values and the missingness is related to the response variable. A real data set related to modeling bankruptcy of a firm is then analyzed. The paper concludes with discussion of adaptation of these results to logistic regression, and other potential generalizations.