Assessment and pruning of hierarchical model based clustering

  • Authors:
  • Jeremy Tantrum;Alejandro Murua;Werner Stuetzle

  • Affiliations:
  • University of Washington, Seattle, WA;University of Washington, Seattle, WA;University of Washington, Seattle, WA

  • Venue:
  • Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The goal of clustering is to identify distinct groups in a dataset. The basic idea of model-based clustering is to approximate the data density by a mixture model, typically a mixture of Gaussians, and to estimate the parameters of the component densities, the mixing fractions, and the number of components from the data. The number of distinct groups in the data is then taken to be the number of mixture components, and the observations are partitioned into clusters (estimates of the groups) using Bayes' rule. If the groups are well separated and look Gaussian, then the resulting clusters will indeed tend to be "distinct" in the most common sense of the word - contiguous, densely populated areas of feature space, separated by contiguous, relatively empty regions. If the groups are not Gaussian, however, this correspondence may break down; an isolated group with a non-elliptical distribution, for example, may be modeled by not one, but several mixture components, and the corresponding clusters will no longer be well separated. We present methods for assessing the degree of separation between the components of a mixture model and between the corresponding clusters. We also propose a new clustering method that can be regarded as a hybrid between model-based and nonparametric clustering. The hybrid clustering algorithm prunes the cluster tree generated by hierarchical model-based clustering. Starting with the tree corresponding to the mixture model chosen by the Bayesian Information Criterion, it progressively merges clusters that do not appear to correspond to different modes of the data density.