A data pre-processing method to increase efficiency and accuracy in data mining

  • Authors:
  • Amir R. Razavi;Hans Gill;Hans Åhlfeldt;Nosrat Shahsavar

  • Affiliations:
  • Department of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden;Department of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden;Department of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden;Department of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden

  • Venue:
  • AIME'05 Proceedings of the 10th conference on Artificial Intelligence in Medicine
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In medicine, data mining methods such as Decision Tree Induction (DTI) can be trained for extracting rules to predict the outcomes of new patients. However, incompleteness and high dimensionality of stored data are a problem. Canonical Correlation Analysis (CCA) can be used prior to DTI as a dimension reduction technique to preserve the character of the original data by omitting non-essential data. In this study, data from 3949 breast cancer patients were analysed. Raw data were cleaned by running a set of logical rules. Missing values were replaced using the Expectation Maximization algorithm. After dimension reduction with CCA, DTI was employed to analyse the resulting dataset. The validity of the predictive model was confirmed by ten-fold cross validation and the effect of pre-processing was analysed by applying DTI to data without pre-processing. Replacing missing values and using CCA for data reduction dramatically reduced the size of the resulting tree and increased the accuracy of the prediction of breast cancer recurrence.