Missing data imputation using statistical and machine learning methods in a real breast cancer problem

Authors:
José M. Jerez;Ignacio Molina;Pedro J. García-Laencina;Emilio Alba;Nuria Ribelles;Miguel Martín;Leonardo Franco
Affiliations:
Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, E.T.S.I. Informática, Campus de Teatinos s/n, 29071 Málaga, Spain;Departamento de Tecnología Electrónica, Universidad de Málaga, Campus de Teatinos s/n, 29071 Málaga, Spain;Departamento de Tecnologías de la Información y las Comunicaciones, Universidad Politécnica de Cartagena, Plaza del Hospital 1, 30202 Cartagena (Murcia), Spain;Servicio de Oncología Médica, Hospital Clínico Universitario Virgen de la Victoria, Campus de Teatinos s/n, 29010 Málaga, Spain;Servicio de Oncología Médica, Hospital Clínico Universitario Virgen de la Victoria, Campus de Teatinos s/n, 29010 Málaga, Spain;Servicio de Oncología Médica, Hospital Clínico San Carlos, Profesor Martín Lagos s/n, 28040 Madrid, Spain;Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, E.T.S.I. Informática, Campus de Teatinos s/n, 29071 Málaga, Spain
Venue:
Artificial Intelligence in Medicine
Year:
2010

Citing 18
Cited 10

Back-propagation algorithm which varies the number of hidden units

Neural Networks
Original Contribution: A scaled conjugate gradient algorithm for fast supervised learning

Neural Networks
Machine Learning

Machine Learning
Self-Organizing Maps

Self-Organizing Maps
Imputation of Missing Data in Industrial Databases

Applied Intelligence
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Feedforward Neural Network Construction Using Cross Validation

Neural Computation
The use of artificial neural networks in decision support in cancer: A systematic review

Neural Networks
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
The influence of missing value imputation on detection of differentially expressed genes from microarray data

Bioinformatics
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
K nearest neighbours with mutual information for simultaneous classification and missing data imputation

Neurocomputing
Improved heterogeneous distance functions

Journal of Artificial Intelligence Research
Partial identification with missing data: concepts and findings

International Journal of Approximate Reasoning
Current trends on knowledge extraction and neural networks

ICANN'05 Proceedings of the 15th international conference on Artificial neural networks: formal models and their applications - Volume Part II
Towards efficient imputation by nearest-neighbors: a clustering-based approach

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
A Bayesian neural network approach for modelling censored data with an application to prognosis after surgery for breast cancer

Artificial Intelligence in Medicine
A combined neural network and decision trees model for prognosis of breast cancer relapse

Artificial Intelligence in Medicine

A classifier ensemble approach for the missing feature problem

Artificial Intelligence in Medicine
2012 Special Issue: Application of growing hierarchical SOM for visualisation of network forensics traffic data

Neural Networks
A hybrid particle swarm optimization based fuzzy expert system for the diagnosis of coronary artery disease

Expert Systems with Applications: An International Journal
WIMP: Web server tool for missing data imputation

Computer Methods and Programs in Biomedicine
Classifying patterns with missing values using Multi-Task Learning perceptrons

Expert Systems with Applications: An International Journal
Missing data in medical databases: Impute, delete or classify?

Artificial Intelligence in Medicine
An algorithmic approach to missing data problem in modeling human aspects in software development

Proceedings of the 9th International Conference on Predictive Models in Software Engineering
Locally linear reconstruction based missing value imputation for supervised learning

Neurocomputing
Unlearning from demonstration

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
A biological continuum based approach for efficient clinical classification

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Objectives: Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. Materials and methods: Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the ''El Alamo-I'' project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values. Results: The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model. Conclusion: The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures.