Visual recognition of speech consonants using facial movement features

Authors:
Wai Chee Yau;Dinesh Kant Kumar;Sridhar Poosapadi Arjunan
Affiliations:
School of Electrical and Computer Engineering, RMIT University, GPO Box 2476V Melbourne, Victoria 3001, Australia;(Correspd. Tel.: +61399251954/ E-mail: dinesh@rmit.edu.au) School of Electrical and Computer Engineering, RMIT University, GPO Box 2476V Melbourne, Victoria 3001, Australia;School of Electrical and Computer Engineering, RMIT University, GPO Box 2476V Melbourne, Victoria 3001, Australia
Venue:
Integrated Computer-Aided Engineering - Informatics in Control, Automation and Robotics
Year:
2007

Citing 15
Cited 8

On Image Analysis by the Methods of Moments

IEEE Transactions on Pattern Analysis and Machine Intelligence
Invariant Image Recognition by Zernike Moments

IEEE Transactions on Pattern Analysis and Machine Intelligence
Rotation invariant image recognition using features selected via a systematic method

Pattern Recognition
Neural networks: algorithms, applications, and programming techniques

Neural networks: algorithms, applications, and programming techniques
Speech recognition by machines and humans

Speech Communication
The Recognition of Human Movement Using Temporal Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Extraction of Visual Features for Lipreading

IEEE Transactions on Pattern Analysis and Machine Intelligence
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Pattern Recognition and Neural Networks

Pattern Recognition and Neural Networks
Artificial Neural Networks for Image Understanding

Artificial Neural Networks for Image Understanding
A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

Proceedings of the 6th international conference on Multimodal interfaces
Visual Speech Recognition with Loosely Synchronized Feature Streams

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Visual Speech Recognition Using Image Moments and Multiresolution Wavelet Images

CGIV '06 Proceedings of the International Conference on Computer Graphics, Imaging and Visualisation
A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way

A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way
Visual model structures and synchrony constraints for audio-visual speech recognition

IEEE Transactions on Audio, Speech, and Language Processing

Lip-Reading Technique Using Spatio-Temporal Templates and Support Vector Machines

CIARP '08 Proceedings of the 13th Iberoamerican congress on Pattern Recognition: Progress in Pattern Recognition, Image Analysis and Applications
Vision-based technique for secure recognition of voice-less commands

International Journal of Electronic Security and Digital Forensics
A probabilistic neural network for earthquake magnitude prediction

Neural Networks
An efficient fingerprint image compression technique based on wave atoms decomposition and multistage vector quantization

Integrated Computer-Aided Engineering
Talking Agents: A distributed architecture for interactive artistic installations

Integrated Computer-Aided Engineering
Real-time lip reading system for isolated Korean word recognition

Pattern Recognition
Spatio-temporal resolution enhancement of vocal tract MRI sequences based on image registration

Integrated Computer-Aided Engineering
Human automatic detection and tracking for outdoor video

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a visual speech recognition technique using facial movement video. The acoustic signals of consonants are often confusing in noisy environments. To overcome this shortcoming, the focus of this paper is identifying consonants using visual information. This paper investigates the feasibility of using facial movements to identify phonemes. The proposed approach adopts a visual speech model based on the viseme model of the Moving Picture Experts Group 4 (MPEG-4) standard. It is a movement-based system, and the facial movements are segmented from the video using an accumulative image subtraction method that results in a 2-D grayscale motion history image (MHI). The MHI is classified using a combination of the discrete stationary wavelet transform (SWT) and image moments (Hu moments, geometric moments and Zernike moments). Feedforward multilayer perceptron (MLP) neural networks with backpropagation (BPN) learning algorithm are used to classify the features to investigate the performance of the three moment features. The experimental results indicate that Zernike moments have better representation ability and provide rotational invariant property for the proposed application. The results also demonstrate that the proposed technique can identify consonants reliably using the viseme model of MPEG-4 standard with a recognition rate of 85%.