Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

Authors:
Thomas Hueber;Elie-Laurent Benaroya;Gérard Chollet;Bruce Denby;Gérard Dreyfus;Maureen Stone
Affiliations:
Laboratoire d'Electronique, Ecole Supérieure de Physique et de Chimie Industrielles de la Ville de Paris (ESPCI ParisTech), 10 rue Vauquelin, 75231 Paris, Cedex 05, France and CNRS-LTCI, Tele ...;Laboratoire d'Electronique, Ecole Supérieure de Physique et de Chimie Industrielles de la Ville de Paris (ESPCI ParisTech), 10 rue Vauquelin, 75231 Paris, Cedex 05, France;CNRS-LTCI, Telecom ParisTech, 46 rue Barrault, 75634 Paris, Cedex 13, France;Université Pierre et Marie Curie, 4 place Jussieu, 75252 Paris, Cedex 05, France and Laboratoire d'Electronique, Ecole Supérieure de Physique et de Chimie Industrielles de la Ville de Pa ...;Laboratoire d'Electronique, Ecole Supérieure de Physique et de Chimie Industrielles de la Ville de Paris (ESPCI ParisTech), 10 rue Vauquelin, 75231 Paris, Cedex 05, France;Vocal Tract Visualization Lab, University of Maryland Dental School, 650 W. Baltimore Street, Baltimore, MD 21201, USA
Venue:
Speech Communication
Year:
2010

Citing 5
Cited 4

Scale-Space and Edge Detection Using Anisotropic Diffusion

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Task-Specific Contour Tracker for Ultrasound

MMBIA '00 Proceedings of the IEEE Workshop on Mathematical Methods in Biomedical Image Analysis
A tissue-conductive acoustic sensor applied in speech recognition for privacy

Proceedings of the 2005 joint conference on Smart objects and ambient intelligence: innovative context-aware services: usages and technologies
Asynchrony modeling for audio-visual speech recognition

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Speckle reducing anisotropic diffusion

IEEE Transactions on Image Processing

Silent speech interfaces

Speech Communication
ICCHP keynote: recognizing silent and weak speech based on electromyography

ICCHP'10 Proceedings of the 12th international conference on Computers helping people with special needs: Part I
Emerging Input Technologies for Always-Available Mobile Interaction

Foundations and Trends in Human-Computer Interaction
A tongue training system for children with down syndrome

Proceedings of the 26th annual ACM symposium on User interface software and technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents a segmental vocoder driven by ultrasound and optical images (standard CCD camera) of the tongue and lips for a ''silent speech interface'' application, usable either by a laryngectomized patient or for silent communication. The system is built around an audio-visual dictionary which associates visual to acoustic observations for each phonetic class. Visual features are extracted from ultrasound images of the tongue and from video images of the lips using a PCA-based image coding technique. Visual observations of each phonetic class are modeled by continuous HMMs. The system then combines a phone recognition stage with corpus-based synthesis. In the recognition stage, the visual HMMs are used to identify phonetic targets in a sequence of visual features. In the synthesis stage, these phonetic targets constrain the dictionary search for the sequence of diphones that maximizes similarity to the input test data in the visual space, subject to a concatenation cost in the acoustic domain. A prosody-template is extracted from the training corpus, and the final speech waveform is generated using ''Harmonic plus Noise Model'' concatenative synthesis techniques. Experimental results are based on an audiovisual database containing 1h of continuous speech from each of two speakers.