Audio-visual speaker identification using dynamic facial movements and utterance phonetic content

Authors:
Vahid Asadpour;Mohammad Mehdi Homayounpour;Farzad Towhidkhah
Affiliations:
Azad University, Mashhad Branch, Biomedical Engineering Department, Ostad Yousefi, Mashhad, Iran;Amirkabir University of Technology, Computer Engineering Faculty, 424 Hafez, Tehran, Iran;Amirkabir University of Technology, Biomedical Engineering Faculty, 424 Hafez, Tehran, Iran
Venue:
Applied Soft Computing
Year:
2011

Citing 11
Cited 1

Minimum probability of error for asynchronous Gaussian multiple-access channels

IEEE Transactions on Information Theory
Lessons in digital estimation theory

Lessons in digital estimation theory
Unsupervised Optimal Fuzzy Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Fuzzy and Neuro-Fuzzy Systems in Medicine

Fuzzy and Neuro-Fuzzy Systems in Medicine
Adaptive mouth segmentation using chromatic features

Pattern Recognition Letters
Locating and Tracking Facial Speech Features

ICPR '96 Proceedings of the 1996 International Conference on Pattern Recognition (ICPR '96) Volume I - Volume 7270
Biometric identification systems

Signal Processing
Integrating audio and visual information to provide highly robust speech recognition

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
Feature extraction using non-linear transformation for robust speech recognition on the Aurora database

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 02
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia
Selecting fuzzy if-then rules for classification problems using genetic algorithms

IEEE Transactions on Fuzzy Systems

About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems

Optical Memory and Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Robust multimodal identification systems based on audio-visual information has not been thoroughly investigated yet. The aim of this work is to propose a model-based feature extraction method which employs physiological characteristics of facial muscles producing lip movements. This approach adopts the intrinsic properties of muscles such as viscosity, elasticity, and mass which are extracted from the dynamic lip model. These parameters are exclusively dependent on the neuro-muscular properties of speaker; consequently imitation of valid speakers could be reduced to a large extent. These parameters are applied to a Hidden Markov Model (HMM) audio-visual identification system. In this work a combination of audio and video features has been employed by adopting a multistream pseudo-synchronized HMM training method. The proposed model is compared to other feature extraction methods including Kalman filtering, neural networks, adaptive network fuzzy inference system (ANFIS) and auto recursive moving average. The superior performance of the proposed system is demonstrated on a large multispeaker database of continuously spoken digits, along with a sentence that is phonetically rich. The combined Kalman filtering and proposed model led to the best performance. The phonetic content of pronounced sentences is also evaluated to achieve the optimized phonetic combinations which lead to the best identification rate.