A data pre-processing method to increase efficiency and accuracy in data mining

Authors:
Amir R. Razavi;Hans Gill;Hans Åhlfeldt;Nosrat Shahsavar
Affiliations:
Department of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden;Department of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden;Department of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden;Department of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden
Venue:
AIME'05 Proceedings of the 10th conference on Artificial Intelligence in Medicine
Year:
2005

Citing 10
Cited 2

C4.5: programs for machine learning

C4.5: programs for machine learning
A Comparative Analysis of Methods for Pruning Decision Trees

IEEE Transactions on Pattern Analysis and Machine Intelligence
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Data mining: concepts and techniques

Data mining: concepts and techniques
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Decision Trees: An Overview and Their Use in Medicine

Journal of Medical Systems
Does Size Really Matter—Using a Decision Tree Approach for Comparison of Three Different Databases from the Medical Field of Acute Appendicitis

Journal of Medical Systems
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Predicting breast cancer survivability: a comparison of three data mining methods

Artificial Intelligence in Medicine
Uniqueness of medical data mining

Artificial Intelligence in Medicine

Predicting Metastasis in Breast Cancer: Comparing a Decision Tree with Domain Experts

Journal of Medical Systems
Diagnosing Breast Masses in Digital Mammography Using Feature Selection and Ensemble Methods

Journal of Medical Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In medicine, data mining methods such as Decision Tree Induction (DTI) can be trained for extracting rules to predict the outcomes of new patients. However, incompleteness and high dimensionality of stored data are a problem. Canonical Correlation Analysis (CCA) can be used prior to DTI as a dimension reduction technique to preserve the character of the original data by omitting non-essential data. In this study, data from 3949 breast cancer patients were analysed. Raw data were cleaned by running a set of logical rules. Missing values were replaced using the Expectation Maximization algorithm. After dimension reduction with CCA, DTI was employed to analyse the resulting dataset. The validity of the predictive model was confirmed by ten-fold cross validation and the effect of pre-processing was analysed by applying DTI to data without pre-processing. Replacing missing values and using CCA for data reduction dramatically reduced the size of the resulting tree and increased the accuracy of the prediction of breast cancer recurrence.