An efficient feature selection approach for clustering: using a Gaussian mixture model of data dissimilarity

  • Authors:
  • Chieh-Yuan Tsai;Chuang-Cheng Chiu

  • Affiliations:
  • Industrial Engineering and Management Department, Yuan-Ze University, Taiwan, R.O.C.;Industrial Engineering and Management Department, Yuan-Ze University, Taiwan, R.O.C.

  • Venue:
  • ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part I
  • Year:
  • 2007

Quantified Score

Hi-index 0.01

Visualization

Abstract

Rapid advances in computer and database technologies have enabled organizations to accumulate vast amounts of data recently. These huge data make the data analysis task become more complicated. Feature selection is an effective dimensionality reduction technique by removing irrelevant, redundant, or noisy features. This research proposes a novel feature-selecting measure to evaluate feature importance for clustering process. The proposed measure aims at extracting useful information from the dissimilarity between two data objects since data dissimilarity is a common principle to determine whether data objects can be located within the same cluster or not. Therefore, the dissimilarity between a pair of data objects is used to develop the proposed feature-selecting measure. In the research, the probability distribution of the dissimilarity variable is considered as a mixture model consisting of the two "intra-cluster" and "inter-cluster" dissimilarity Gaussian distributions. The means of the two Gaussian distributions can be inferred by the EM algorithm. Accordingly, the difference between the two means is regarded as a meaningful measure to select important features for clustering. The effectiveness of the proposed feature-selecting measure for clustering is demonstrated using a set of experiments.