Dynamic Assignment of Gaussian Components in Modelling Speech Spectra

  • Authors:
  • Parham Zolfaghari;Hiroko Kato;Yasuhiro Minami;Atsushi Nakamura;Shigeru Katagiri;Roy Patterson

  • Affiliations:
  • Speech Open Lab, NTT Communication Science Labs, NTT Corporation, Kyoto, Japan;Speech Open Lab, NTT Communication Science Labs, NTT Corporation, Kyoto, Japan;Speech Open Lab, NTT Communication Science Labs, NTT Corporation, Kyoto, Japan;Speech Open Lab, NTT Communication Science Labs, NTT Corporation, Kyoto, Japan;Speech Open Lab, NTT Communication Science Labs, NTT Corporation, Kyoto, Japan;Centre for the Neural Basis of Hearing, Department of Physiology, University of Cambridge, Cambridge, UK

  • Venue:
  • Journal of VLSI Signal Processing Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we describe a parametric mixture model for modelling the resonant characteristics of the vocal tract where Gaussian distributions are used to model spectral frequency regions. A mixtures of Gaussian (MoG) based parametrisation scheme is used for modelling a smoothed representation of the spectra. This smoothing procedure removes all signal periodicity from the spectra allowing highly natural analysis, manipulation and synthesis of speech. The goal of this parametrisation scheme is to ease the correspondence between the resonant characteristics of the vocal tract and the parametric distributions and modelling the spectrum with an appropriate number of parameters. Previously, a maximum likelihood (ML) approach to this parametrisation scheme was introduced. However, this approach has inherent local optima problems. Noting that, a relatively small class of Gaussian densities can approximate a large class of distributions, we propose a new scheme whereby starting with a large number of distributions in the mixture, we systematically reduce their number and re-approximate the densities in the mixture based on a distance criterion. The Kullback-Leibler (KL) distance was found to allow optimal MoG solutions to the spectra. Furthermore, a fitness measure based on KL information is used to provide a figure for estimating the model order in representing formant-like features. The proposed model is subjectively evaluated and is shown to reduce the number of Gaussian with an appreciable loss in the quality of the re-synthesised speech.