Improving supervised learning performance by using fuzzy clustering method to select training data

  • Authors:
  • Donghai Guan;Weiwei Yuan;Young-Koo Lee;Andrey Gavrilov;Sungyoung Lee

  • Affiliations:
  • Department of Computer Engineering, Kyung Hee University, Korea;Department of Computer Engineering, Kyung Hee University, Korea;(Correspd. Tel.: +82 31 201 3732/ E-mail: yklee@khu.ac.kr) Department of Computer Engineering, Kyung Hee University, Korea;Department of Computer Engineering, Kyung Hee University, Korea;Department of Computer Engineering, Kyung Hee University, Korea

  • Venue:
  • Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - Fuzzy theory and technology with applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The crucial issue in many classification applications is how to achieve the best possible classifier with a limited number of labeled data for training. Training data selection is one method which addresses this issue by selecting the most informative data for training. In this work, we propose three data selection mechanisms based on fuzzy clustering method: center-based selection, border-based selection and hybrid selection. Center-based selection selects the samples with high degree of membership in each cluster as training data. Border-based selection selects the samples around the border between clusters. Hybrid selection is the combination of center-based selection and border-based selection. Compared with existing work, our methods do not require much computational effort. Moreover, they are independent with respect to the supervised learning algorithms and initial labeled data. We use fuzzy c-means to implement our data selection mechanisms. The effects of them are empirically studied on a set of UCI data sets. Experimental results indicate that, compared with random selection, hybrid selection can effectively enhance the learning performance in all the data sets, center-based selection shows better performance in certain data sets, border-based selection does not show significant improvement.