Unit selection in a concatenative speech synthesis system using a large speech database

Authors:
A. J. Hunt;A. W. Black
Affiliations:
ATR Interpreting Telecommun. Res. Labs., Kyoto, Japan;Dept. of Electron. Eng., Chinese Univ. of Hong Kong, Shatin, Hong Kong
Venue:
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Year:
1996

Citing 0
Cited 80

Interactive Speech Translation in the Diplomat Project

Machine Translation
A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Speech Communication
A segmental speech coder based on a concatenative TTS

Speech Communication
Algebraic Models of Speech Segment Databases

TSD '01 Proceedings of the 4th International Conference on Text, Speech and Dialogue
Phonetic alignment: speech synthesis-based vs. viterbi-based

Speech Communication
Speaking with hands: creating animated conversational characters from recordings of human performance

ACM SIGGRAPH 2004 Papers
Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data

IEEE Transactions on Visualization and Computer Graphics
Statistical modeling for unit selection in speech synthesis

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Learning to say it well: reranking realizations by predicted synthesis quality

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multisyn: Open-domain unit selection for the Festival speech synthesis system

Speech Communication
Adaptive Concatenative Sound Synthesis and Its Application to Micromontage Composition

Computer Music Journal
The listening room: a speech-based interactive art installation

Proceedings of the 15th international conference on Multimedia
Acoustic speech unit segmentation for concatenative synthesis

Computer Speech and Language
Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

Speech Communication
Introduction to digital speech processing

Foundations and Trends in Signal Processing
A Romanian syllable-based text-to-speech system

ISPRA'07 Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation
A Romanian syllable-based text-to-speech system

ISPRA'07 Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation
Diphone Databases for Lithuanian Text-to-Speech Synthesis

Informatica
Boundary Refining Aiming at Speech Synthesis Applications

PROPOR '08 Proceedings of the 8th international conference on Computational Processing of the Portuguese Language
Evolutionary-Based Design of a Brazilian Portuguese Recording Script for a Concatenative Synthesis System

PROPOR '08 Proceedings of the 8th international conference on Computational Processing of the Portuguese Language
IDEAS4Games: Building Expressive Virtual Characters for Computer Games

IVA '08 Proceedings of the 8th international conference on Intelligent Virtual Agents
Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Enhancing Animated Agents in an Instrumented Poker Game

KI '08 Proceedings of the 31st annual German conference on Advances in Artificial Intelligence
HMM-Based Speech Synthesis for the Greek Language

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Post-recording tool for instant casting movie system

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Integrating phrasing and intonation modelling using syntactic and morphosyntactic information

Speech Communication
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

IEICE - Transactions on Information and Systems
Regionalized Text-to-Speech Systems: Persona Design and Application Scenarios

Multimodal Signals: Cognitive and Algorithmic Issues
Developing a reading tutor: Design and evaluation of dedicated speech recognition and synthesis modules

Speech Communication
Review: Statistical parametric speech synthesis

Speech Communication
Design of the Test Stimuli for the Evaluation of Concatenation Cost Functions

TSD '09 Proceedings of the 12th International Conference on Text, Speech and Dialogue
On the importance of audiovisual coherence for the perceived quality of synthesized visual speech

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation
Optimization of an image-based talking head system

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation
Emphatic visual speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Robust speaker-adaptive HMM-based text-to-speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Expressive concatenative synthesis by reusing samples from real performance recordings

Computer Music Journal
Unit selection using k-nearest neighbor search for concatenative speech synthesis

Proceedings of the 3rd International Universal Communication Symposium
Enhancing Accessibility of Web Content for the Print-Impaired and Blind People

USAB '09 Proceedings of the 5th Symposium of the Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society on HCI and Usability for e-Inclusion
Implementation of Three Text to Speech Systems for Kurdish Language

CIARP '09 Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Unit Selection Using Linguistic, Prosodic and Spectral Distance for Developing Text-to-Speech System in Hindi

PReMI '09 Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence
Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech

Speech Communication
On the detection of discontinuities in concatenative speech synthesis

Progress in nonlinear speech processing
Extracting user preferences by GTM for aiGA weight tuning in unit selection text-to-speech synthesis

IWANN'07 Proceedings of the 9th international work conference on Artificial neural networks
Enrich web applications with voice internet persona text-to-speech for anyone, anywhere

HCI'07 Proceedings of the 12th international conference on Human-computer interaction: intelligent multimodal interaction environments
Emotion conversion based on prosodic unit selection

IEEE Transactions on Audio, Speech, and Language Processing
A dynamic cost weighting framework for unit selection text-to-speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
User preferences can drive facial expressions: evaluating an embodied conversational agent in a recommender dialogue system

User Modeling and User-Adapted Interaction
The user model-based summarize and refine approach improves information presentation in spoken dialog systems

Computer Speech and Language
Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Photorealistic 2D audiovisual text-to-speech synthesis using active appearance models

Proceedings of the SSPNET 2nd International Symposium on Facial Analysis and Animation
The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate

Speech Communication
Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept

Speech Communication
Performance: what does a body know

CHI '11 Extended Abstracts on Human Factors in Computing Systems
Performance: what does a body know?

CHI '11 Extended Abstracts on Human Factors in Computing Systems
Corpus design for a unit selection TtS system with application to Bulgarian

LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Two methods for assessing oral reading prosody

ACM Transactions on Speech and Language Processing (TSLP)
A review of personality in voice-based man machine interaction

HCII'11 Proceedings of the 14th international conference on Human-computer interaction: interaction techniques and environments - Volume Part II
Personalized voice assignment techniques for synchronized scenario speech output in entertainment systems

Proceedings of the 2011 international conference on Virtual and mixed reality: systems and applications - Volume Part II
Development of syllable-based text to speech synthesis system in Bengali

International Journal of Speech Technology
Analysis of data collected in listening tests for the purpose of evaluation of concatenation cost functions

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Identifying concatenation discontinuities by hierarchical divisive clustering of pitch contours

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis

Speech Communication
A phonetic analysis of natural laughter, for use in automatic laughter processing systems

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part I
Oscillating statistical moments for speech polarity detection

NOLISP'11 Proceedings of the 5th international conference on Advances in nonlinear speech processing
The effects of windowing on the calculation of MFCCs for different types of speech sounds

NOLISP'11 Proceedings of the 5th international conference on Advances in nonlinear speech processing
Dynamic mapping method based speech driven face animation system

ACII'05 Proceedings of the First international conference on Affective Computing and Intelligent Interaction
Affective computing: a review

ACII'05 Proceedings of the First international conference on Affective Computing and Intelligent Interaction
A new spectral smoothing algorithm for unit concatenating speech synthesis

AI'05 Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence
Selecting prosody parameters for unit selection based chinese TTS

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Application of Genetic Algorithm in unit selection for Malay speech synthesis system

Expert Systems with Applications: An International Journal
Nonlinear speech features for the objective detection of discontinuities in concatenative speech synthesis

Nonlinear Speech Modeling and Applications
Motion-driven concatenative synthesis of cloth sounds

ACM Transactions on Graphics (TOG) - SIGGRAPH 2012 Conference Proceedings
Evaluation of TTS systems in intelligibility and comprehension tasks

ROCLING '11 Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing
Syllable Specific Unit Selection Cost Functions for Text-to-Speech Synthesis

ACM Transactions on Speech and Language Processing (TSLP)
Optimal weight tuning method for unit selection cost functions in syllable based text-to-speech synthesis

Applied Soft Computing
Expressive speech synthesis: a review

International Journal of Speech Technology
Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis

Speech Communication
Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise

Computer Speech and Language
Synthesis of Spontaneous Speech With Syllable Contraction Using State-Based Context-Dependent Voice Transformation

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Speech polarity determination: A comparative evaluation

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning.