Face active appearance modeling and speech acoustic information to recover articulation

Authors:
Athanassios Katsamanis;George Papandreou;Petros Maragos
Affiliations:
School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece;School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece;School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece
Venue:
IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Year:
2009

Citing 13
Cited 3

Fundamentals of speech recognition

Fundamentals of speech recognition
Quantitative association of vocal-tract and facial behavior

Speech Communication - Special issue on auditory-visual speech processing
Lip movement synthesis from speech based on hidden Markov models

Speech Communication - Special issue on auditory-visual speech processing
Direct Least Square Fitting of Ellipses

IEEE Transactions on Pattern Analysis and Machine Intelligence
Active Appearance Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System

Journal of VLSI Signal Processing Systems
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

Speech Communication
Wiener filters in canonical coordinates for transform coding,filtering, and quantizing

IEEE Transactions on Signal Processing
Visual model structures and synchrony constraints for audio-visual speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia
Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling

IEEE Transactions on Multimedia
Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis

IEEE Transactions on Multimedia

Detection of landmines and underground utilities from acoustic and GPR images with a cepstral approach

Journal of Visual Communication and Image Representation
Fingerprint recognition using mel-frequency cepstral coefficients

Pattern Recognition and Image Analysis
Research on the distal supervised learning model of speech inversion

ICICA'12 Proceedings of the Third international conference on Information Computing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are interested in recovering aspects of vocal tract's geometry and dynamics from speech, a problem referred to as speech inversion. Traditional audio-only speech inversion techniques are inherently ill-posed since the same speech acoustics can be produced by multiple articulatory configurations. To alleviate the ill-posedness of the audio-only inversion process, we propose an inversion scheme which also exploits visual information from the speaker's face, The complex audiovisual-to-articulatory mapping is approximated by an adaptive piecewise linear model. Model switching is governed by a Markovian discrete process which captures articulatory dynamic information. Each constituent linear mapping is effectively estimated via canonical correlation analysis. In the described multimodal context, we investigate alternative fusion schemes which allow interaction between the audio and visual modalities at various synchronization levels. For facial analysis, we employ active appearance models (AAMs) and demonstrate fully automatic face tracking and visual feature extraction. Using the AAM features in conjunction with audio features such as Mel frequency cepstral coefficients (MFCCs) or line spectral frequencies (LSFs) leads to effective estimation of the trajectories followed by certain points of interest in the speech production system. We report experiments on the QSMT and MOCHA databases which contain audio, video, and electromagnetic articulography data recorded in parallel. The results show that exploiting both audio and visual modalities in a multistream hidden Markov model based scheme clearly improves performance relative to either audio or visual-only estimation.