Supervised learning approaches and feature selection - a case study in diabetes

  • Authors:
  • Yugowati Praharsi;Shaou-Gang Miaou;Hui-Ming Wee

  • Affiliations:
  • Department of Industrial and System Engineering, Chung Yuan Christian University, Chung Li, 32023, Taiwan/ Department of Information Technology, Satya Wacana Christian University, Salatiga, 50711, ...;Department of Electronic Engineering, Chung Yuan Christian University, Chung Li, 32023, Taiwan;Department of Industrial and System Engineering, Chung Yuan Christian University, No. 200, Chung Pei Rd., Chungli, 32023, Taiwan

  • Venue:
  • International Journal of Data Analysis Techniques and Strategies
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data description and classification are important tasks in supervised learning. In this study, three supervised learning methods such as k-nearest neighbour k-NN, support vector data description SVDD and support vector machine SVM are considered because they do not suffer from the problem of introducing a new class. The data sample chosen is Pima Indians diabetes. The results show that feature selection based on mean information gain and a standard deviation threshold can be considered as a substitute for forward selection. This indicates that data variation using information gain is an important factor that must be considered in selecting feature subset. Finally, among eight candidate features, glucose level is the most prominent feature for diabetes detection in all classifiers and feature selection methods under consideration. Relevancy measurement in information gain can sort out the most important feature to the least significant one. It can be very useful in medical applications such as defining feature prioritisation for symptom recognition.