Model selection for probabilistic clustering using cross-validatedlikelihood

Authors:
Padhraic Smyth
Affiliations:
Information and Computer Science, University of California, Irvine, CA 92697-3425 (Also with the Jet Propulsion Laboratory 126-347, California Institute of Technology, Pasadena, CA 91109)
Venue:
Statistics and Computing
Year:
2000

Citing 4
Cited 23

Elements of information theory

Elements of information theory
Efficient Approximations for the MarginalLikelihood of Bayesian Networks with Hidden Variables

Machine Learning - Special issue on learning with probabilistic representations
Linearly Combining Density Estimators via Stacking

Machine Learning
Mixfit: An Algorithm for the Automatic Fitting and Testing of Normal Mixture Models

ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 1 - Volume 1

Unsupervised Learning of Finite Mixture Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood

IEEE Transactions on Pattern Analysis and Machine Intelligence
Fitting of mixtures with unspecified number of components using cross validation distance estimate

Computational Statistics & Data Analysis
Data mining tasks and methods: Clustering: numerical clustering

Handbook of data mining and knowledge discovery
References

Sourcebook of parallel computing
Translation-invariant mixture models for curve clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering Aggregation

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Beyond Tracking: Modelling Activity and Understanding Behaviour

International Journal of Computer Vision
How Many Clusters? An Information-Theoretic Perspective

Neural Computation
Clustering aggregation

ACM Transactions on Knowledge Discovery from Data (TKDD)
Newtonian clustering: An approach based on molecular dynamics and global optimization

Pattern Recognition
A quick procedure for model selection in the case of mixture of normal densities

Computational Statistics & Data Analysis
Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

IEEE Transactions on Knowledge and Data Engineering
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length

IEEE Transactions on Pattern Analysis and Machine Intelligence
Complexity control in a mixture model by the Hardy-Weinberg equilibrium

Computational Statistics & Data Analysis
Evaluation of BIC and Cross Validation for model selection on sequence segmentations

International Journal of Data Mining and Bioinformatics
Discriminative structure selection method of Gaussian Mixture Models with its application to handwritten digit recognition

Neurocomputing
Searching for Coexpressed Genes in Three-Color cDNA Microarray Data Using a Probabilistic Model-Based Hough Transform

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Unsupervised discretization using tree-based density estimation

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
RSQRT: An heuristic for estimating the number of clusters to report

Electronic Commerce Research and Applications
Extracting robust distribution using adaptive Gaussian Mixture Model and online feature selection

Neurocomputing
Estimation of finite mixtures with symmetric components

Statistics and Computing
Estimating the predominant number of clusters in a dataset

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cross-validated likelihood is investigated as a tool for automaticallydetermining the appropriate number of components (given the data) in finitemixture modeling, particularly in the context of model-based probabilisticclustering. The conceptual framework for the cross-validation approach to modelselection is straightforward in the sense that models are judged directly ontheir estimated out-of-sample predictive performance. The cross-validationapproach, as well as penalized likelihood and McLachlan's bootstrapmethod, areapplied to two data sets and the results from all three methods are in closeagreement. The second data set involves a well-known clustering problem fromthe atmospheric science literature using historical records of upper atmospheregeopotential height in the Northern hemisphere. Cross-validated likelihoodprovides an interpretable and objective solution to the atmospheric clusteringproblem. The clusters found are in agreement with prior analyses of the samedata based on non-probabilistic clustering techniques.