The IBM expressive text-to-speech synthesis system for American English

Authors:
J. F. Pitrelli;R. Bakis;E. M. Eide;R. Fernandez;W. Hamza;M. A. Picheny
Affiliations:
IBM T. J. Watson Res. Center, Yorktown Heights, NY;-;-;-;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2006

Citing 0
Cited 11

Evolutionary-Based Design of a Brazilian Portuguese Recording Script for a Concatenative Synthesis System

PROPOR '08 Proceedings of the 8th international conference on Computational Processing of the Portuguese Language
A Style Control Technique for HMM-Based Expressive Speech Synthesis

IEICE - Transactions on Information and Systems
Automatic refinement of an expressive speech corpus assembling subjective perception and automatic classification

Speech Communication
Accessibility of board and presentations in the classroom: a design-for-all approach

Telehealth/AT '08 Proceedings of the IASTED International Conference on Telehealth/Assistive Technologies
Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech

Speech Communication
On the use of nonverbal speech sounds in human communication

COST 2102'07 Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours
Are synthesized video descriptions acceptable?

Proceedings of the 12th international ACM SIGACCESS conference on Computers and accessibility
An intuitive style control technique in HMM-based expressive speech synthesis using subjective style intensity and multiple-regression global variance model

Speech Communication
Expressive speech synthesis: a review

International Journal of Speech Technology
In the game: the interface between Watson and Jeopardy!

IBM Journal of Research and Development
Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Expressive text-to-speech (TTS) synthesis should contribute to the pleasantness, intelligibility, and speed of speech-based human-machine interactions which use TTS. We describe a TTS engine which can be directed, via text markup, to use a variety of expressive styles, here, questioning, contrastive emphasis, and conveying good and bad news. Differences in these styles lead us to investigate two approaches for expressive TTS, a "corpus-driven" and a "prosodic-phonology" approach. Each speaker records 11 h (excluding silences) of "neutral" sentences. In the corpus-driven approach, the speaker also records 1-h corpora in each expressive style; these segments are tagged by style for use during search, and decision trees for determining f0 contours and timing are trained separately for each of the neutral and expressive corpora. In the prosodic-phonology approach, rules translating certain expressive markup elements to tones and break indices (ToBI) are manually determined, and the ToBI elements are used in single f0 and duration trees for all expressions. Tests show that listeners identify synthesis in particular styles ranging from 70% correctly for "conveying bad news" to 85% for "yes-no questions". Further improvements are demonstrated through the use of speaker-pooled f0 and duration models