Thousands of voices for HMM-based speech synthesis: analysis and application of TTS systems built on various ASR corpora

Authors:
Junichi Yamagishi;Bela Usabaev;Simon King;Oliver Watts;John Dines;Jilei Tian;Yong Guan;Rile Hu;Keiichiro Oura;Yi-Jian Wu;Keiichi Tokuda;Reima Karhila;Mikko Kurimo
Affiliations:
Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;Universität Tübingen, Tübingen, Germany;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;Idiap Research Institute, Martigny, Switzerland;Nokia Research Center, Beijing, China;Nokia Research Center, Beijing, China;Nokia Research Center, Beijing, China;Department of Computer Science and Engineering, Nagoya Institute of Technology, Nagoya, Japan;TTS Group, Microsoft Business Division, Beijing, China and Nagoya Institute of Technology, Nagoya, Japan;Department of Computer Science and Engineering, Nagoya Institute of Technology, Nagoya, Japan;Adaptive Informatics Research Centre, Helsinki University of Technology, TKK, Finland;Adaptive Informatics Research Centre, Helsinki University of Technology, TKK, Finland
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 20
Cited 4

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Speech Communication
DARPA resource management benchmark test results June 1990

HLT '90 Proceedings of the workshop on Speech and Natural Language
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
The development of the HTK Broadcast News transcription system: an overview

Speech Communication - Special issue on automatic transcription of broadcast news data
The design for the wall street journal-based CSR corpus

HLT '91 Proceedings of the workshop on Speech and Natural Language
1993 benchmark tests for the ARPA spoken language program

HLT '94 Proceedings of the workshop on Human Language Technology
Tree-based state tying for high accuracy acoustic modelling

HLT '94 Proceedings of the workshop on Human Language Technology
Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005

IEICE - Transactions on Information and Systems
Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

IEICE - Transactions on Information and Systems
Multisyn: Open-domain unit selection for the Festival speech synthesis system

Speech Communication
Speech synthesis using HMMs with dynamic features

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
The application of hidden Markov models in speech recognition

Foundations and Trends in Signal Processing
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

IEICE - Transactions on Information and Systems
A Hidden Semi-Markov Model-Based Speech Synthesis System

IEICE - Transactions on Information and Systems
Review: Statistical parametric speech synthesis

Speech Communication
Optimizing segment label boundaries for statistical speech synthesis

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Robust speaker-adaptive HMM-based text-to-speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
An HMM-based mandarin chinese text-to-speech system

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm

IEEE Transactions on Audio, Speech, and Language Processing
The ATR multilingual speech-to-speech translation system

IEEE Transactions on Audio, Speech, and Language Processing

Personalising speech-to-speech translation in the EMIME project

ACLDemos '10 Proceedings of the ACL 2010 System Demonstrations
A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis

Speech Communication
Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Speech Communication
Human and computer recognition of regional accents and ethnic groups from British English speech

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an "average voice model" plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on "non-TTS" corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJO, WSJ1, and WSJCAMO), Resource Management, Globalphone, and SPEECON databases.