Fitting of mixtures with unspecified number of components using cross validation distance estimate

  • Authors:
  • Maja Miloslavsky;Mark J. van der Laan

  • Affiliations:
  • Division of Biostatistics, School of Public Health, University of California, Berkeley, CA;Division of Biostatistics, School of Public Health, University of California, Berkeley, CA

  • Venue:
  • Computational Statistics & Data Analysis
  • Year:
  • 2003

Quantified Score

Hi-index 0.03

Visualization

Abstract

Estimation of the number of mixture components (k) is an unsolved problem. Available methods for estimation of k include bootstrapping the likelihood ratio test statistic and optimizing a variety of validity functionals. We investigate the minimization of distance between fitted mixture model and the true density as a method for estimating k. The distances considered are Kullback-Leibler (KL) and L2. We estimate these distances using cross validation. A reliable estimate of k is obtained by voting of B estimates of k corresponding to B cross validation estimates of distance. This estimation method with KL distance is very similar to Monte Carlo cross validated likelihood method discussed by Smyth (Statist. Computing 10(1) (2000) 63). With focus on univariate normal mixtures, we present simulation studies that compare the cross validated distance method with Akaika's Information Criterion (AIC), Bayesian Information Criterion/Minimum description criterion (BIC/MDL), and Information Complexity (ICOMP). We also apply the cross validation estimate of distance approach along with AIC, BIC/MDL and ICOMP approach, to data from an osteoporosis drug trial in order to find groups that differentially respond to treatment. In our closing remarks, we highlight the general applicability of our method to choose between any set of estimators of a particular parameter of interest, assuming the presence of an approximately unbiased estimator.