Improvement to a NAM-captured whisper-to-speech system

Authors:
Viet-Anh Tran;Gérard Bailly;Hélène Lvenbruck;Tomoki Toda
Affiliations:
GIPSA-lab, UMR 5216 CNRS/Grenoble Universities, France;GIPSA-lab, UMR 5216 CNRS/Grenoble Universities, France;GIPSA-lab, UMR 5216 CNRS/Grenoble Universities, France;Graduate School of Information Science, Nara Institute of Science and Technology, Japan
Venue:
Speech Communication
Year:
2010

Citing 5
Cited 3

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

Speech Communication
Active Appearance Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Web Browser Control Using EMG Based Sub Vocal Speech Recognition

HICSS '05 Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences - Volume 09
A tissue-conductive acoustic sensor applied in speech recognition for privacy

Proceedings of the 2005 joint conference on Smart objects and ambient intelligence: innovative context-aware services: usages and technologies
Voice conversion for various types of body transmitted speech

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing

Silent speech interfaces

Speech Communication
ICCHP keynote: recognizing silent and weak speech based on electromyography

ICCHP'10 Proceedings of the 12th international conference on Computers helping people with special needs: Part I
Emerging Input Technologies for Always-Available Mobile Interaction

Foundations and Trends in Human-Computer Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exploiting a tissue-conductive sensor - a stethoscopic microphone - the system developed at NAIST which converts non-audible murmur (NAM) to audible speech by GMM-based statistical mapping is a very promising technique. The quality of the converted speech is however still insufficient for computer-mediated communication, notably because of the poor estimation of F"0 from unvoiced speech and because of impoverished phonetic contrasts. This paper presents our investigations to improve the intelligibility and naturalness of the synthesized speech and first objective and subjective evaluations of the resulting system. The first improvement concerns voicing and F"0 estimation. Instead of using a single GMM for both, we estimate a continuous F"0 using a GMM, trained on target voiced segments only. The continuous F"0 estimation is filtered by a voicing decision computed by a neural network. The objective and subjective improvement is significant. The second improvement concerns the input time window and its dimensionality reduction: we show that the precision of F"0 estimation is also significantly improved by extending the input time window from 90 to 450ms and by using a Linear Discriminant Analysis (LDA) instead of the original Principal Component Analysis (PCA). Estimation of spectral envelope is also slightly improved with LDA but is degraded with larger time windows. A third improvement consists in adding visual parameters both as input and output parameters. The positive contribution of this information is confirmed by a subjective test. Finally, HMM-based conversion is compared with GMM-based conversion.