Multi-speaker articulatory trajectory formation based on speaker-independent articulatory HMMs

  • Authors:
  • Sadao Hiroya;Takemi Mochida

  • Affiliations:
  • NTT Communication Science Laboratories, NTT Corporation, 3-1 Morinosato-Wakamiya, Atsugi-shi, Kanagawa 243-0198, Japan;NTT Communication Science Laboratories, NTT Corporation, 3-1 Morinosato-Wakamiya, Atsugi-shi, Kanagawa 243-0198, Japan

  • Venue:
  • Speech Communication
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Inter-speaker variability in the speech spectrum domain has been modeled using speaker-adaptive training (SAT), in which speaker-independent phoneme-specific hidden Markov models (HMMs) were used along with a speaker-adaptive matrix. In this paper, multi-speaker articulatory trajectory formation based on this method is presented. Both speaker-independent and speaker-specific features are statistically separated from a multi-speaker articulatory database, which consists of the mid-sagittal motion data of the lips, incisor, and tongue measured with an electro-magnetic articulographic (EMA) system. We evaluated the proposed method in terms of the RMS error between the measured and estimated articulatory parameters. When multi-speaker models of articulatory parameters with two speaker-adaptive matrices for each speaker were used, the average RMS error of articulatory parameters was 1.29mm and showed no statistically significant difference from that for speaker-dependent models (1.22mm). For comparison, multi-speaker models of the conventional speech spectrum were also constructed using a multi-speaker spectrum database, which consists of speech data simultaneously recorded during the articulatory measurements. The average spectral distance between the vocal-tract and estimated spectrum from two-matrix models was 4.19dB and showed a statistically significant difference from that for speaker-dependent models (3.97dB). These results indicate that modeling of inter-speaker variability in the articulatory parameter domain with a small number of matrices for each speaker almost perfectly approximates the speaker dependency of articulation and is better than that in the speech spectrum domain.