MML Clustering of Continuous-Valued Data Using Gaussian and t Distributions

Authors:
Yudi Agusta;David L. Dowe
Affiliations:
-;-
Venue:
AI '02 Proceedings of the 15th Australian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
Year:
2002

Citing 5
Cited 2

Sphere-packings, lattices, and groups

Sphere-packings, lattices, and groups
On the Length of Programs for Computing Finite Binary Sequences

Journal of the ACM (JACM)
Unsupervised Learning of Finite Mixture Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Finding overlapping components with MML

Statistics and Computing
MML clustering of multi-state, Poisson, vonMises circular and Gaussian distributions

Statistics and Computing

Unsupervised Selection of a Finite Dirichlet Mixture Model: An MML-Based Approach

IEEE Transactions on Knowledge and Data Engineering
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering, also known as mixture modelling or intrinsic classification, is the problem of identifying and modelling components (or clusters, or classes) in a body of data. We consider here the application of the Minimum Message Length (MML) principle to a clustering problem of Gaussian and t distributions. Earlier work in the MML clustering was conducted in regards to the multinomial and Gaussian distributions (Wallace and Boulton, 1968) and in addition, the von Mises circular and Poisson distributions (Wallace and Dowe, 1994, 2000). Our current work extends this by applying the Gaussian distribution to the more general t distribution. Point estimation of the t distribution is performed using the MML approximation proposed by Wallace and Freeman (1987). A comparison of the MML estimations of the t distribution to those of the Maximum Likelihood (ML) method in terms of their Kullback-Leibler (KL) distances is also provided. Within each component, our application also performs a model selection on whether a particular group of data is best modelled as a Gaussian or a t distribution. The proposed modelling method is then applied to several artificially generated datasets. The modelling results are compared to the results obtained when using the MML clustering of Gaussian distributions. Our modelling method compares quite well to an alternative clustering program (EMMIX) which uses various modelling criteria such as the Akaike Information Criterion (AIC) and Schwarz's Bayesian Information Criterion (BIC).