A maximum likelihood approximation method for Dirichlet's parameter estimation

Authors:
Nicolas Wicker;Jean Muller;Ravi Kiran Reddy Kalathur;Olivier Poch
Affiliations:
Laboratoire de Bioinformatique et de Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS/INSERM/ULP, BP 10142, 67404 Illkirch Cede ...;Laboratoire de Bioinformatique et de Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS/INSERM/ULP, BP 10142, 67404 Illkirch Cede ...;Laboratoire de Bioinformatique et de Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS/INSERM/ULP, BP 10142, 67404 Illkirch Cede ...;Laboratoire de Bioinformatique et de Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS/INSERM/ULP, BP 10142, 67404 Illkirch Cede ...
Venue:
Computational Statistics & Data Analysis
Year:
2008

Citing 3
Cited 0

A Classification EM algorithm for clustering and two stochastic versions

Computational Statistics & Data Analysis - Special issue on optimization techniques in statistics
Unsupervised Selection of a Finite Dirichlet Mixture Model: An MML-Based Approach

IEEE Transactions on Knowledge and Data Engineering
Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application

IEEE Transactions on Image Processing

Quantified Score

Hi-index	0.03

Visualization

Abstract

Dirichlet distributions are natural choices to analyse data described by frequencies or proportions since they are the simplest known distributions for such data apart from the uniform distribution. They are often used whenever proportions are involved, for example, in text-mining, image analysis, biology or as a prior of a multinomial distribution in Bayesian statistics. As the Dirichlet distribution belongs to the exponential family, its parameters can be easily inferred by maximum likelihood. Parameter estimation is usually performed with the Newton-Raphson algorithm after an initialisation step using either the moments or Ronning's methods. However this initialisation can result in parameters that lie outside the admissible region. A simple and very efficient alternative based on a maximum likelihood approximation is presented. The advantages of the presented method compared to two other methods are demonstrated on synthetic data sets as well as for a practical biological problem: the clustering of protein sequences based on their amino acid compositions.