Learning the number of Gaussian cusing hypothesis test

  • Authors:
  • Gyeongyong Heo;Paul Gader

  • Affiliations:
  • Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL;Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL

  • Venue:
  • IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses the problem of estimating the correct number of components in a Gaussian mixture given a sample data set. In particular, an extension of Gaussian-means (G-means) and Projected Gaussian-means (PG-means) algorithms is proposed. All these methods are based on one-dimensional statistical hypothesis test. G-means and PG-means are wrapper algorithms of the k-means and Expectation-Maximization (EM) algorithms, respectively. Although G-means is a simple and fast algorithm, it does not perform well when clusters overlap since it is based on k-means. PG-means can handle overlapped clusters but requires more computation and sometimes fails to find the right number of clusters. In this paper, we propose an extension, called Extended Projected Gaussian means (XPG-means). XPG-means is a wrapper algorithm of Possibilistic Fuzzy C-means (PFCM) algorithm. XPG-means integrates the advantages of both algorithms while resolving some of the disadvantages involving overlapped clusters, noise, and computational complexity. More specifically, XPG-means handles overlapped clusters better than G-means because of the use of fuzzy clustering, handles noise better than both algorithms because it uses possibilitistic clustering. XPG-means is less computationally expensive than PG-means because it uses local hypothesis testing scheme used by G-means that is specific to Gaussians wherease PG-means uses a more general Kolmogorov-Smirnow test on Gaussian mixtures. In addition, XPG-means demonstrates less variance in estimating the number of components than either of the other algorithms.