A classifier ensemble approach for the missing feature problem

Authors:
Loris Nanni;Alessandra Lumini;Sheryl Brahnam
Affiliations:
Department of Information Engineering, University of Padua, Via Gradenigo, 6/B, 35131 Padova, Italy;DEIS, University of Bologna, Via Venezia 52, 47521 Cesena, Italy;Computer Information Systems, Missouri State University, 901 S. National, Springfield, MO 65804, USA
Venue:
Artificial Intelligence in Medicine
Year:
2012

Citing 15
Cited 1

The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Bayesian networks for imputation in classification problems

Journal of Intelligent Information Systems
Impact of imputation of missing values on classification error for discrete data

Pattern Recognition
Ensemble generation and feature selection for the identification of students with learning disabilities

Expert Systems with Applications: An International Journal
AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES

Applied Artificial Intelligence
Robust smoothing of gridded data in one and higher dimensions with missing values

Computational Statistics & Data Analysis
Pattern classification with missing data: a review

Neural Computing and Applications - Special Issue - KES2008
Selection-fusion approach for classification of datasets with missing values

Pattern Recognition
Learn++.MF: A random subspace approach for the missing feature problem

Pattern Recognition
Missing data imputation using statistical and machine learning methods in a real breast cancer problem

Artificial Intelligence in Medicine
A neural network-based framework for the reconstruction of incomplete data sets

Neurocomputing
A Novel Framework for Imputation of Missing Values in Databases

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Ensemble-based regression analysis of multimodal medical data for osteopenia diagnosis

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Objectives: Many classification problems must deal with data that contains missing values. In such cases data imputation is critical. This paper evaluates the performance of several statistical and machine learning imputation methods, including our novel multiple imputation ensemble approach, using different datasets. Materials and methods: Several state-of-the-art approaches are compared using different datasets. Some state-of-the-art classifiers (including support vector machines and input decimated ensembles) are tested with several imputation methods. The novel approach proposed in this work is a multiple imputation method based on random subspace, where each missing value is calculated considering a different cluster of the data. We have used a fuzzy clustering approach for the clustering algorithm. Results: Our experiments have shown that the proposed multiple imputation approach based on clustering and a random subspace classifier outperforms several other state-of-the-art approaches. Using the Wilcoxon signed-rank test (reject the null hypothesis, level of significance 0.05) we have shown that the proposed best approach is outperformed by the classifier trained using the original data (i.e., without missing values) only when 20% of the data are missed. Moreover, we have shown that coupling an imputation method with our cluster based imputation we outperform the base method (level of significance ~0.05). Conclusion: Starting from the assumptions that the feature set must be partially redundant and that the redundancy is distributed randomly over the feature set, we have proposed a method that works quite well even when a large percentage of the features is missing (=30%). Our best approach is available (MATLAB code) at bias.csr.unibo.it/nanni/MI.rar.