Model selection for probabilistic clustering using cross-validatedlikelihood

  • Authors:
  • Padhraic Smyth

  • Affiliations:
  • Information and Computer Science, University of California, Irvine, CA 92697-3425 (Also with the Jet Propulsion Laboratory 126-347, California Institute of Technology, Pasadena, CA 91109)

  • Venue:
  • Statistics and Computing
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cross-validated likelihood is investigated as a tool for automaticallydetermining the appropriate number of components (given the data) in finitemixture modeling, particularly in the context of model-based probabilisticclustering. The conceptual framework for the cross-validation approach to modelselection is straightforward in the sense that models are judged directly ontheir estimated out-of-sample predictive performance. The cross-validationapproach, as well as penalized likelihood and McLachlan's bootstrapmethod, areapplied to two data sets and the results from all three methods are in closeagreement. The second data set involves a well-known clustering problem fromthe atmospheric science literature using historical records of upper atmospheregeopotential height in the Northern hemisphere. Cross-validated likelihoodprovides an interpretable and objective solution to the atmospheric clusteringproblem. The clusters found are in agreement with prior analyses of the samedata based on non-probabilistic clustering techniques.