Thousands of voices for HMM-based speech synthesis: analysis and application of TTS systems built on various ASR corpora

  • Authors:
  • Junichi Yamagishi;Bela Usabaev;Simon King;Oliver Watts;John Dines;Jilei Tian;Yong Guan;Rile Hu;Keiichiro Oura;Yi-Jian Wu;Keiichi Tokuda;Reima Karhila;Mikko Kurimo

  • Affiliations:
  • Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;Universität Tübingen, Tübingen, Germany;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;Idiap Research Institute, Martigny, Switzerland;Nokia Research Center, Beijing, China;Nokia Research Center, Beijing, China;Nokia Research Center, Beijing, China;Department of Computer Science and Engineering, Nagoya Institute of Technology, Nagoya, Japan;TTS Group, Microsoft Business Division, Beijing, China and Nagoya Institute of Technology, Nagoya, Japan;Department of Computer Science and Engineering, Nagoya Institute of Technology, Nagoya, Japan;Adaptive Informatics Research Centre, Helsinki University of Technology, TKK, Finland;Adaptive Informatics Research Centre, Helsinki University of Technology, TKK, Finland

  • Venue:
  • IEEE Transactions on Audio, Speech, and Language Processing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an "average voice model" plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on "non-TTS" corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJO, WSJ1, and WSJCAMO), Resource Management, Globalphone, and SPEECON databases.