Fitting of mixtures with unspecified number of components using cross validation distance estimate

Authors:
Maja Miloslavsky;Mark J. van der Laan
Affiliations:
Division of Biostatistics, School of Public Health, University of California, Berkeley, CA;Division of Biostatistics, School of Public Health, University of California, Berkeley, CA
Venue:
Computational Statistics & Data Analysis
Year:
2003

Citing 3
Cited 9

An improvement of the NEC criterion for assessing the number of clusters in a mixture model

Non-Linear Analysis
Akaike's information criterion and recent developments in information complexity

Journal of Mathematical Psychology
Model selection for probabilistic clustering using cross-validatedlikelihood

Statistics and Computing

Editorial: recent developments in mixture models

Computational Statistics & Data Analysis
Estimation for the number of components in a mixture model using stepwise split-and-merge EM algorithm

Pattern Recognition Letters
A Learning Scheme for Recognizing Sub-classes from Model Trained on Aggregate Classes

SSPR & SPR '08 Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Joint modelling of multivariate longitudinal outcomes and a time-to-event: A nonlinear latent class approach

Computational Statistics & Data Analysis
Automatic model selection by cross-validation for probabilistic PCA

Neural Processing Letters
Using hidden Markov model to uncover processing states from eye movements in information search tasks

Cognitive Systems Research
Estimation of finite mixtures with symmetric components

Statistics and Computing
Using conditional independence for parsimonious model-based Gaussian clustering

Statistics and Computing
Constrained Multilevel Latent Class Models for the Analysis of Three-Way Three-Mode Binary Data

Journal of Classification

Quantified Score

Hi-index	0.03

Visualization

Abstract

Estimation of the number of mixture components (k) is an unsolved problem. Available methods for estimation of k include bootstrapping the likelihood ratio test statistic and optimizing a variety of validity functionals. We investigate the minimization of distance between fitted mixture model and the true density as a method for estimating k. The distances considered are Kullback-Leibler (KL) and L2. We estimate these distances using cross validation. A reliable estimate of k is obtained by voting of B estimates of k corresponding to B cross validation estimates of distance. This estimation method with KL distance is very similar to Monte Carlo cross validated likelihood method discussed by Smyth (Statist. Computing 10(1) (2000) 63). With focus on univariate normal mixtures, we present simulation studies that compare the cross validated distance method with Akaika's Information Criterion (AIC), Bayesian Information Criterion/Minimum description criterion (BIC/MDL), and Information Complexity (ICOMP). We also apply the cross validation estimate of distance approach along with AIC, BIC/MDL and ICOMP approach, to data from an osteoporosis drug trial in order to find groups that differentially respond to treatment. In our closing remarks, we highlight the general applicability of our method to choose between any set of estimators of a particular parameter of interest, assuming the presence of an approximately unbiased estimator.