Prediction of Fundamental Frequency and Voicing From Mel-Frequency Cepstral Coefficients for Unconstrained Speech Reconstruction

Authors:
Ben Milner;Xu Shao
Affiliations:
Sch. of Comput. Sci., East Anglia Univ., Norwich;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2007

Citing 0
Cited 3

Feature selection using singular value decomposition and QR factorization with column pivoting for text-independent speaker identification

Speech Communication
Speech recognition on mobile devices

Mobile Multimedia Processing
Alaryngeal Speech Enhancement Based on One-to-Many Eigenvoice Conversion

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work proposes a method for predicting the fundamental frequency and voicing of a frame of speech from its mel-frequency cepstral coefficient (MFCC) vector representation. This information is subsequently used to enable a speech signal to be reconstructed solely from a stream of MFCC vectors and has particular application in distributed speech recognition systems. Prediction is achieved by modeling the joint density of fundamental frequency and MFCCs. This is first modeled using a Gaussian mixture model (GMM) and then extended by using a set of hidden Markov models to link together a series of state-dependent GMMs. Prediction accuracy is measured on unconstrained speech input for both a speaker-dependent system and a speaker-independent system. A fundamental frequency prediction error of 3.06% is obtained on the speaker-dependent system in comparison to 8.27% on the speaker-independent system. On the speaker-dependent system 5.22% of frames have voicing errors compared to 8.82% on the speaker-independent system. Spectrogram analysis of reconstructed speech shows that highly intelligible speech is produced with the quality of the speaker-dependent speech being slightly higher owing to the more accurate fundamental frequency and voicing predictions