Speech synthesis using HMMs with dynamic features

Authors:
T. Masuko;K. Tokuda;T. Kobayashi;S. Imai
Affiliations:
Precision & Intelligence Lab., Tokyo Inst. of Technol., Yokohama, Japan;Dept. of Electr. Eng. & Electron., Liverpool Univ., UK;Res. Lab., IBM Japan Ltd., Tokyo, Japan;Dragon Syst. Inc., Newton, MA, USA
Venue:
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Year:
1996

Citing 0
Cited 8

Tone correctness improvement in speaker dependent HMM-based Thai speech synthesis

Speech Communication
Tone correctness improvement in speaker-independent average-voice-based Thai speech synthesis

Speech Communication
A Hidden Semi-Markov Model-Based Speech Synthesis System

IEICE - Transactions on Information and Systems
Robust speaker-adaptive HMM-based text-to-speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Voice conversion based on weighted frequency warping

IEEE Transactions on Audio, Speech, and Language Processing
Thousands of voices for HMM-based speech synthesis: analysis and application of TTS systems built on various ASR corpora

IEEE Transactions on Audio, Speech, and Language Processing
Design and first implementation of a spoken language interface for Romanian language

DNCOCO'06 Proceedings of the 5th WSEAS international conference on Data networks, communications and computers
HMM-based emotional speech synthesis using average emotion model

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new text-to-speech synthesis system based on HMM which includes dynamic features, i.e., delta and delta-delta parameters of speech. The system uses triphone HMMs as the synthesis units. The triphone HMMs share less than 2,000 clustered states, each of which is modelled by a single Gaussian distribution. For a given text to be synthesized, a sentence HMM is constructed by concatenating the triphone HMMs. Speech parameters are generated from the sentence HMM in such a way that the output probability is maximized. The speech signal is synthesized directly from the obtained parameters using the mel log spectral approximation (MLSA) filter. Without dynamic features, the discontinuity of the generated speech spectra causes glitches in the synthesized speech. On the other hand, with dynamic features, the synthesized speech becomes quite smooth and natural even if the number of clustered states is small.