Handling Missing Values when Applying Classification Models

Authors:
Maytal Saar-Tsechansky;Foster Provost
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2007

Citing 0
Cited 27

Data acquisition and cost-effective predictive modeling: targeting offers for electronic commerce

Proceedings of the ninth international conference on Electronic commerce
Active Feature-Value Acquisition

Management Science
Flexible decision tree for data stream classification in the presence of concept change, noise and missing values

Data Mining and Knowledge Discovery
Concept Learning from (Very) Ambiguous Examples

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Exploiting Data Missingness in Bayesian Network Modeling

IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
Cautious Collective Classification

The Journal of Machine Learning Research
An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

The Journal of Machine Learning Research
Predicting incomplete gene microarray data with the use of supervised learning algorithms

Pattern Recognition Letters
Towards learning rules from natural texts

FAM-LbR '10 Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading
A review and comparison of strategies for handling missing values in separate-and-conquer rule learning

Journal of Intelligent Information Systems
Predicting clicks in a vocabulary learning system

HLT-SS '11 Proceedings of the ACL 2011 Student Session
A robust missing value imputation method for noisy data

Applied Intelligence
Sequential feature selection for classification

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
Predictive analytics in information systems research

MIS Quarterly
Research Note---Generating Shareable Statistical Databases for Business Value: Multiple Imputation with Multimodal Perturbation

Information Systems Research
An evolving associative classifier for incomplete database

ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
Information enhancement for data mining

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Besting the quiz master: crowdsourcing incremental classification games

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Classifying patterns with missing values using Multi-Task Learning perceptrons

Expert Systems with Applications: An International Journal
Optimum estimation of missing values in randomized complete block design by genetic algorithm

Knowledge-Based Systems
Creating and benchmarking a new dataset for physical activity monitoring

Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments
An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks

Data & Knowledge Engineering
Skyline queries in crowd-enabled databases

Proceedings of the 16th International Conference on Extending Database Technology
Boosting with side information

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part I
An algorithmic approach to missing data problem in modeling human aspects in software development

Proceedings of the 9th International Conference on Predictive Models in Software Engineering
Imprecise imputation as a tool for solving classification problems with mean values of unobserved features

Advances in Artificial Intelligence
The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods---predictive value imputation, the distribution-based imputation used by C4.5, and using reduced models---for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments.