Unit selection in a concatenative speech synthesis system using a large speech database
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Perceptual and objective detection of discontinuities in concatenative speech synthesis
ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02
Hi-index | 0.00 |
Unit selection speech synthesis involves concatenating segments of speech contained in a large database in such a way as to create novel utterances. The sequence of speech segments is chosen using a cost function. In particular the join cost determines how well consecutive speech segments fit together by extracting acoustic parameters from frames of speech on either side of a potential join point and calculating the distance between them. Although many different metrics have been proposed, there is very little agreement on what constitutes an appropriate window length, with values in the literature ranging from 5 ms to 30 ms. Clearly it is not possible to compare the performance of different metrics when the role of such a fundamental parameter such as window length is not properly investigated with real speech signals. Here we address this short-coming by focusing on one of the most common metrics, the mel-frequency cepstral coefficient (MFCC) [1] and show with experimental results that the choice of window length has a direct impact on the MFCC values calculated, and that the ability of the distance measure to predict discontinuity differs with respect to both the width of the windowing function and the whether the sounds are vowels, voiceless fricatives and voiced fricatives.