A hybrid prediction model with F-score feature selection for type II Diabetes databases

  • Authors:
  • B. Sarojini Ilango;N. Ramaraj

  • Affiliations:
  • K.L.N. College of Information Technology, Madurai;G.K.M. College of Engineering & Technology, Chennai

  • Venue:
  • Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The medical data are multidimensional, and are represented by a large number of features. Hundreds of independent features (parameters) in these high dimensional databases need to be simultaneously considered and analyzed, for valuable decision-making information in medical prediction. Most data mining methods depend on a set of features that define the behavior of the learning algorithm and directly or indirectly influence the complexity of the resulting models. Hence, to improve the efficiency and accuracy of mining task on high dimensional data, the data must be preprocessed by an efficient dimensionality reduction method. The aim of this study is to improve the diagnostic accuracy of diabetes disease by selecting informative features of Pima Indians Diabetes Dataset. This study proposes a Hybrid Prediction Model with F-score feature selection approach to identify the optimal feature subset of the Pima Indians Diabetes dataset. The features of diabetes dataset are ranked using F-score and the feature subset that gives the minimal clustering error is the optimal feature subset of the dataset. The correctly classified instances determine the pattern for diagnosis and are used for further classification process. The improved performance of the Support Vector Machine classifier measured in terms of Accuracy of the classifier, Sensitivity, Specificity and Area Under Curve (AUC) proves that the proposed feature approach indeed improves the performance of classification. The proposed prediction model achieves a predictive accuracy of 98.9427 and it is the highest predictive accuracy for diabetes dataset compared to other models in literature for this problem.