Medical data mining by fuzzy modeling with selected features

  • Authors:
  • Sean N. Ghazavi;Thunshun W. Liao

  • Affiliations:
  • Industrial Engineering Department, 3128 Patrick F. Taylor Hall, Louisiana State University, Baton Rouge, LA 70803, United States;Industrial Engineering Department, 3128 Patrick F. Taylor Hall, Louisiana State University, Baton Rouge, LA 70803, United States

  • Venue:
  • Artificial Intelligence in Medicine
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

Objective: Medical data is often very high dimensional. Depending upon the use, some data dimensions might be more relevant than others. In processing medical data, choosing the optimal subset of features is such important, not only to reduce the processing cost but also to improve the usefulness of the model built from the selected data. This paper presents a data mining study of medical data with fuzzy modeling methods that use feature subsets selected by some indices/methods. Methods: Specifically, three fuzzy modeling methods including the fuzzy k-nearest neighbor algorithm, a fuzzy clustering-based modeling, and the adaptive network-based fuzzy inference system are employed. For feature selection, a total of 11 indices/methods are used. Medical data mined include the Wisconsin breast cancer dataset and the Pima Indians diabetes dataset. The classification accuracy and computational time are reported. To show how good the best performer is, the globally optimal was also found by carrying out an exhaustive testing of all possible combinations of feature subsets with three features. Results: For the Wisconsin breast cancer dataset, the best accuracy of 97.17% was obtained, which is only 0.25% lower than that was obtained by exhaustive testing. For the Pima Indians diabetes dataset, the best accuracy of 77.65% was obtained, which is only 0.13% lower than that obtained by exhaustive testing. Conclusion: This paper has shown that feature selection is important to mining medical data for reducing processing time and for increasing classification accuracy. However, not all combinations of feature selection and modeling methods are equally effective and the best combination is often data-dependent, as supported by the breast cancer and diabetes data analyzed in this paper.