A cross-language state sharing and mapping approach to bilingual (Mandarin-English) TTS

Authors:
Yao Qian;Hui Liang;Frank K. Soong
Affiliations:
Microsoft Research Asia, Beijing, China;Idiap Research Institute, Martigny, Switzerland and Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 1
Cited 2

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Speech Communication
Unknown word extraction from multilingual code-switching sentences

ROCLING '11 ROCLING 2011 Poster Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a hidden Markov model (HMM)-based bilingual (Mandarin and English) text-to-speech (TTS) system to synthesize natural speech for given bilingual text. A simple baseline system consisting of two independent monolingual HMM synthesizers is built first from corresponding Mandarin and English data recorded by a bilingual speaker. A new, mixed language TTS is then constructed by asking language-independent and language-specific questions for sharing HMM states across the two languages in decision-tree based clustering. By sharing states, the new system has a smaller footprint than the baseline system. Speech synthesized by the new system sounds very similar to the baseline for non-mixed, Mandarin or English, monolingual sentences but much better for mixed-language sentences. This higher quality of mixed-language output is confirmed by a preference score, 60.2% to 39.8%, in a subjective listening test. A cross-language state mapping algorithm is further proposed for cross-language synthesis when only monolingual (English) recorded data from a source language speaker is available. Mandarin speech is then synthesized with the HMM model parameters in the nearest neighbor leaf nodes of the English decision tree. The nearest neighbor is measured with the Kullback-Leibler divergence (KLD) and mappings between leaf nodes in the decision trees of the source and target languages are established via the speech data recorded by a different, bilingual speaker. High voice (speaker) similarity is preserved in the synthesized target language sentences by using the recording of a source language from a monolingual speaker. Perceptual test results conducted on synthesized Mandarin speech show 1) high intelligibility which is confirmed by a Chinese character transcription accuracy of 92.1% and 2) decent speech quality with an average MOS score of 3.1.