Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis

Authors:
Catherine Mayo;Robert A. J. Clark;Simon King
Affiliations:
Centre for Speech Technology Research, University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, United Kingdom;Centre for Speech Technology Research, University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, United Kingdom;Centre for Speech Technology Research, University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, United Kingdom
Venue:
Speech Communication
Year:
2011

Citing 3
Cited 1

The effect of cue-enhancement on the intelligibility of nonsense word and sentence materials presented in noise

Speech Communication
Multisyn: Open-domain unit selection for the Festival speech synthesis system

Speech Communication
Perceptual and objective detection of discontinuities in concatenative speech synthesis

ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02

On the development of an automatic voice pleasantness classification and intensity estimation system

Computer Speech and Language

Quantified Score

Hi-index	0.02

Visualization

Abstract

The quality of current commercial speech synthesis systems is now so high that system improvements are being made at subtle sub- and supra-segmental levels. Human perceptual evaluation of such subtle improvements requires a highly sophisticated level of perceptual attention to specific acoustic characteristics or cues. However, it is not well understood what acoustic cues listeners attend to by default when asked to evaluate synthetic speech. It may, therefore, be potentially quite difficult to design an evaluation method that allows listeners to concentrate on only one dimension of the signal, while ignoring others that are perceptually more important to them. The aim of the current study was to determine which acoustic characteristics of unit-selection synthetic speech are most salient to listeners when evaluating the naturalness of such speech. This study made use of multidimensional scaling techniques to analyse listeners' pairwise comparisons of synthetic speech sentences. Results indicate that listeners place a great deal of perceptual importance on the presence of artifacts and discontinuities in the speech, somewhat less importance on aspects of segmental quality, and very little importance on stress/intonation appropriateness. These relative differences in importance will impact on listeners' ability to attend to these different acoustic characteristics of synthetic speech, and should therefore be taken into account when designing appropriate methods of synthetic speech evaluation.