Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis

  • Authors:
  • Takayoshi Yoshimura;Keiichi Tokuda;Takashi Masuko;Takao Kobayashi;Tadashi Kitamura

  • Affiliations:
  • Research-Domain 21, Toyota Central R&D Labs., Inc., Aichi, 480-1192 Japan;Department of Computer Science, Nagoya Institute of Technology, Nagoya, 466-8555 Japan;Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama, 226-8502 Japan;Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama, 226-8502 Japan;Department of Computer Science, Nagoya Institute of Technology, Nagoya, 466-8555 Japan

  • Venue:
  • Systems and Computers in Japan
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we introduce a mixed excitation model into an HMM-based speech synthesis system with the objective of improving the quality of synthesized speech. In previous work we have proposed a text-to-speech synthesis system that synthesized speech by outputting speech parameters using HMMs that model Mel frequency ceptral coefficients, fundamental frequencies, and duration. In that system we used a simple model as the excitation source model for exciting the synthesis filter (an MLSA filter) whereby we switched between a pulse sequence and white noise for intervals of voiced and unvoiced speech, respectively. When using that type of excitation model, it is not possible to synthesize speech such as voiced fricatives that contains both a periodic and an aperiodic component, and this is a cause of poor synthesized speech quality. Therefore, in this paper we incorporate a mixed excitation model based on a narrowband vocoding method MELP that combines a pulse stream with white noise with a view to realizing high-quality speech synthesis. Since this excitation model can be applied to wideband vocoding as well as narrowband, we anticipate that it will prove effective for speech synthesis. In addition, we introduce a widely used vocoding method, a postfilter, in order to improve the quality of the synthesized speech. In addition, the results of subjective evaluation show the effectiveness of the mixed excitation model and postfilter in this system. © 2005 Wiley Periodicals, Inc. Syst Comp Jpn, 36(12): 43–50, 2005; Published online in Wiley InterScience (). DOI 10.1002/scj.20354