The effects of windowing on the calculation of MFCCs for different types of speech sounds

  • Authors:
  • Amelia C. Kelly;Christer Gobl

  • Affiliations:
  • Phonetics and Speech Laboratory, Centre for Language and Communication Studies, SLSCS, Trinity College, Dublin, Ireland;Phonetics and Speech Laboratory, Centre for Language and Communication Studies, SLSCS, Trinity College, Dublin, Ireland

  • Venue:
  • NOLISP'11 Proceedings of the 5th international conference on Advances in nonlinear speech processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Unit selection speech synthesis involves concatenating segments of speech contained in a large database in such a way as to create novel utterances. The sequence of speech segments is chosen using a cost function. In particular the join cost determines how well consecutive speech segments fit together by extracting acoustic parameters from frames of speech on either side of a potential join point and calculating the distance between them. Although many different metrics have been proposed, there is very little agreement on what constitutes an appropriate window length, with values in the literature ranging from 5 ms to 30 ms. Clearly it is not possible to compare the performance of different metrics when the role of such a fundamental parameter such as window length is not properly investigated with real speech signals. Here we address this short-coming by focusing on one of the most common metrics, the mel-frequency cepstral coefficient (MFCC) [1] and show with experimental results that the choice of window length has a direct impact on the MFCC values calculated, and that the ability of the distance measure to predict discontinuity differs with respect to both the width of the windowing function and the whether the sounds are vowels, voiceless fricatives and voiced fricatives.