Audio-visual speech recognition using MPEG-4 compliant visual features

Authors:
Petar S. Aleksic;Jay J. Williams;Zhilin Wu;Aggelos K. Katsaggelos
Affiliations:
Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL;Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL;Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL;Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL
Venue:
EURASIP Journal on Applied Signal Processing
Year:
2002

Citing 18
Cited 12

An improved automatic lipreading system to enhance speech recognition

CHI '88 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Feature extraction from faces using deformable templates

International Journal of Computer Vision
Fundamentals of speech recognition

Fundamentals of speech recognition
Speech recognition in noisy environments: a survey

Speech Communication
Machine vision

Machine vision
Speechreading using probabilistic models

Computer Vision and Image Understanding - Special issue on physics-based modeling and reasoning in computer vision
Speech recognition by machines and humans

Speech Communication
Speechreading by Man and Machine: Models, Systems, and Applications

Speechreading by Man and Machine: Models, Systems, and Applications
Finite-Element Methods for Active Contour Models and Balloons for 2-D and 3-D Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications

ECCV '96 Proceedings of the 4th European Conference on Computer Vision-Volume II - Volume II
Gradient Vector Flow: A New External Force for Snakes

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Robust face feature analysis for automatic speechreading and character animation

FG '96 Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition (FG '96)
Speech-to-video conversion for individuals with impaired hearing

Speech-to-video conversion for individuals with impaired hearing
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia
Lipreading from color video

IEEE Transactions on Image Processing
Snakes, shapes, and gradient vector flow

IEEE Transactions on Image Processing
An efficient use of MPEG-4 FAP interpolation for facial animation at 70 bits/frame

IEEE Transactions on Circuits and Systems for Video Technology
An HMM-based speech-to-video synthesizer

IEEE Transactions on Neural Networks

Lip Tracking for MPEG-4 Facial Animation

ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
Parametric models for facial features segmentation

Signal Processing
Multimodal speaker/speech recognition using lip motion, lip texture and audio

Signal Processing - Special section: Multimodal human-computer interfaces
Local spatiotemporal descriptors for visual recognition of spoken phrases

Proceedings of the international workshop on Human-centered multimedia
Image and video for hearing impaired people

Journal on Image and Video Processing
Visual lip activity detection and speaker detection using mouth region intensities

IEEE Transactions on Circuits and Systems for Video Technology
Lip contour extraction for language learning in VEC3D

Machine Vision and Applications
Lipreading with local spatiotemporal descriptors

IEEE Transactions on Multimedia
Combining edge detection and region segmentation for lip contour extraction

AMDO'10 Proceedings of the 6th international conference on Articulated motion and deformable objects
Lipreading procedure for liveness verification in video authentication systems

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part I
GPU accelerated image processing for lip segmentation

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Fisher Linear Discriminant Analysis for text-image combination in multimedia information retrieval

Pattern Recognition

Quantified Score

Hi-index	0.01

Visualization

Abstract

We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs) supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA) was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR) experiments. Both single-stream and multistream hidden Markov models (HMMs) were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words) speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0-30 dB) with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.