Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence

Authors:
Jordi Adell;David Escudero;Antonio Bonafonte
Affiliations:
Universitat Politècnica de Catalunya, Barcelona, Spain;Universidad de Valladolid, Valladolid, Spain;Universitat Politècnica de Catalunya, Barcelona, Spain
Venue:
Speech Communication
Year:
2012

Citing 11
Cited 1

Progress in speech synthesis

Progress in speech synthesis
Communication and prosody: functional aspects of prosody

Speech Communication - Dialogue and prosody
Acoustic Cues for Classifying Communicative Intentions in Dialogue Systems

TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
Modeling filled pauses in medical dictations

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Applying data mining techniques to corpus based prosodic modeling

Speech Communication
Multisyn: Open-domain unit selection for the Festival speech synthesis system

Speech Communication
Statistical language modeling for speech disfluencies

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Review: Statistical parametric speech synthesis

Speech Communication
Filled pauses in speech synthesis: towards conversational speech

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
The more humanlike, the better? How speech type and users' cognitive style affect social responses to computers

Computers in Human Behavior
Emotion conversion based on prosodic unit selection

IEEE Transactions on Audio, Speech, and Language Processing

Glissando: a corpus for multidisciplinary prosodic studies in Spanish and Catalan

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Until now, speech synthesis has mainly involved reading-style speech. Today, however, text-to-speech systems must provide a variety of styles because users expect these interfaces to do more than just read information. If synthetic voices must be integrated into future technology, they must simulate the way people talk instead of the way people read. Existing knowledge about how disfluencies occur has made it possible to propose a general framework for synthesising disfluencies. We propose a model based on the definition of disfluency and the concept of underlying fluent sentences. The model incorporates the parameters of standard prosodic models for fluent speech with local modifications of prosodic parameters near the interruption point. The constituents of the local models for filled pauses are derived from the analysis corpus, and constituent's prosodic parameters are predicted via linear regression analysis. We also discuss the implementation details of the model when used in a real speech synthesis system. Objective and perceptual evaluations showed that the proposed models outperformed the baseline model. Perceptual evaluations of the system showed that it is possible to synthesise filled pauses without decreasing the overall naturalness of the system, and users stated that the speech produced is even more natural than the one produced without filled pauses.