ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
The design for the wall street journal-based CSR corpus
HLT '91 Proceedings of the workshop on Speech and Natural Language
Tree-based state tying for high accuracy acoustic modelling
HLT '94 Proceedings of the workshop on Human Language Technology
Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005
IEICE - Transactions on Information and Systems
Speech translation: coupling of recognition and translation
ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis
IEICE - Transactions on Information and Systems
A Hidden Semi-Markov Model-Based Speech Synthesis System
IEICE - Transactions on Information and Systems
A cross-language state sharing and mapping approach to bilingual (Mandarin-English) TTS
IEEE Transactions on Audio, Speech, and Language Processing
IEEE Transactions on Audio, Speech, and Language Processing
An adaptive algorithm for mel-cepstral analysis of speech
ICASSP'92 Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1
Juicer: a weighted finite-state transducer speech decoder
MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
IEEE Transactions on Audio, Speech, and Language Processing
Importance of High-Order N-Gram Models in Morph-Based Speech Recognition
IEEE Transactions on Audio, Speech, and Language Processing
Impacts of machine translation and speech synthesis on speech-to-speech translation
Speech Communication
Hi-index | 0.00 |
In the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech.