Audiovisual-to-articulatory inversion

Authors:
Hedvig Kjellström;Olov Engwall
Affiliations:
Computer Vision and Active Perception Lab, School of Computer Science and Communication, KTH (Royal Institute of Technology), SE-100 44 Stockholm, Sweden;Centre for Speech Technology (CTT), School of Computer Science and Communication, KTH (Royal Institute of Technology), SE-100 44 Stockholm, Sweden
Venue:
Speech Communication
Year:
2009

Citing 9
Cited 3

Speech analysis and synthesis methods developed at ECL in NTT-From LPC to LSP-

Speech Communication - Special issue: Speech research in Japan
Quantitative association of vocal-tract and facial behavior

Speech Communication - Special issue on auditory-visual speech processing
Active Appearance Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Extraction of Visual Features for Lipreading

IEEE Transactions on Pattern Analysis and Machine Intelligence
Detecting Faces in Images: A Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence
Accurate, Real-Time, Unadorned Lip Tracking

ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
Sparse bayesian learning and the relevance vector machine

The Journal of Machine Learning Research
Visual Speech Recognition with Loosely Synchronized Feature Streams

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia

A comprehensive audio-visual corpus for teaching sound persian phoneme articulation

SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
Mapping between acoustic and articulatory gestures

Speech Communication
Clustering Persian viseme using phoneme subspace for developing visual speech application

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has been shown that acoustic-to-articulatory inversion, i.e. estimation of the articulatory configuration from the corresponding acoustic signal, can be greatly improved by adding visual features extracted from the speaker's face. In order to make the inversion method usable in a realistic application, these features should be possible to obtain from a monocular frontal face video, where the speaker is not required to wear any special markers. In this study, we investigate the importance of visual cues for inversion. Experiments with motion capture data of the face show that important articulatory information can be extracted using only a few face measures that mimic the information that could be gained from a video-based method. We also show that the depth cue for these measures is not critical, which means that the relevant information can be extracted from a frontal video. A real video-based face feature extraction method is further presented, leading to similar improvements in inversion quality. Rather than tracking points on the face, it represents the appearance of the mouth area using independent component images. These findings are important for applications that need a simple audiovisual-to-articulatory inversion technique, e.g. articulatory phonetics training for second language learners or hearing-impaired persons.